Make Tree an AllocatorAwareContainer [REQUEST]

BenFrantzDale commented 3 years ago

Is your feature request related to a problem? Please describe.

https://en.cppreference.com/w/cpp/named_req/AllocatorAwareContainer

As far as I can tell, OpenVDB does all of its memory management with new and delete as in child = new ChildT(xyz, mNodes[n].getValue(), mValueMask.isOn(n)); and delete mNodes[n].getChild();. C++17 in particular provided stateful allocators designed for use with pmr allocators. Things like memory pools and monotonic_buffer_resource seem ideal for some usage patterns of OpenVDB.

Describe the solution you'd like

I think this would involve adding a template parameter to InternalNode:

template<typename _ChildNodeType, Index Log2Dim, typename Allocator = AllocatorForChildNodeT<_ChildNodeType>>
class InternalNode 
{
public:
    using allocator_type = Allocator;

where AllocatorForChildNodeT would be a metafunction so that if _ChildNodeType is an InternalNode it rebinds that allocator and if it's LeafNode<...> it is std::allocator<LeafNode<...>> and adding const Allocator& alloc = Allocator() arguments to InternalNode's constructor.

I think that then gets propagated up the tree so RootNode would do the same thing and. Likewise Tree (which has its RootNode by value) would just have using allocator_type = typename RootNodeType::allocator_type; and would add const allocator_type& alloc = allocator_type() arguments to its c'tors.

We'd have to add the member allocators and change the child allocation to use the allocators and change things like

template<typename T, Index N1=5, Index N2=4, Index N3=3, typename Allocator=std::allocator<T>>
struct Tree4 {
    using Type = Tree<RootNode<InternalNode<InternalNode<LeafNode<T, N3>, N2, std::allocator_traits<Allocator>::template rebind_alloc<LeafNode<T, N3>>, N1>>>;
};

Describe alternatives you've considered

I haven't experimented with this, and so far for my purposes I haven't found need of this... yet. But everything I've read/watched about allocators makes me think it's relevant to OpenVDB: From custom allocators for fixed-sized blocks to freeing large trees all at once.

Additional context

One particularly interesting trick with allocators is what John Lakos calls "winking out", where a large tree structure (sound familiar?) can be build with a memory resource and then when done, rather than freeing each node of the structure, the entire thing can be freed in one big free.

Idclip commented 3 years ago

Hi @BenFrantzDale thanks for submitting this! We discussed this in our TSC meeting last week (see notes under Forum here: https://raw.githubusercontent.com/AcademySoftwareFoundation/openvdb/master/tsc/meetings/2021-03-16.md). This is an interesting concept which some have experimented with in the past but, unfortunately, without the success what would warrant such a huge change to the library. I think everyone would be very interested should an implementation which demos a significant performance boost be presented, but it would have to be weighed against how much complexity it introduced.

I'm sure @kmuseth, @danrbailey, @jmlait will add to this should you have any more thoughts.

BenFrantzDale commented 3 years ago

That all makes sense. I haven’t tried it. The “wink out” optimization feels potentially interesting but again, I haven’t tried it.

BenFrantzDale commented 3 years ago

In particular, I'd be curious if people have profiled the destruction of enormous trees. I see that ~Tree() calls Tree::clear() which calls stealNodes in serial then

    tbb::parallel_for(tbb::blocked_range<size_t>(0, leafnodes.size()),
        DeallocateNodes<LeafNodeType>(leafnodes));

then same for internal nodes. Given someone put the tbb::parallel_for in there, I'm guessing that means it's perf-sensitive. It'd be pretty cool if in some cases all that work could just go away, leaving just a few frees of some big contiguous blocks of memory.

danrbailey commented 3 years ago

In particular, I'd be curious if people have profiled the destruction of enormous trees. I see that ~Tree() calls Tree::clear() which calls stealNodes in serial then
    tbb::parallel_for(tbb::blocked_range<size_t>(0, leafnodes.size()),
        DeallocateNodes<LeafNodeType>(leafnodes));
then same for internal nodes. Given someone put the tbb::parallel_for in there, I'm guessing that means it's perf-sensitive. It'd be pretty cool if in some cases all that work could just go away, leaving just a few frees of some big contiguous blocks of memory.

I have profiled this and it is comparatively expensive. A single contiguous block of memory is much cheaper to allocate and deallocate. However, it's no longer possible to dynamically change the topology of the Tree (or it becomes challenging to manage). Perhaps there's a middle-ground where larger blocks of memory could be allocated and freed(?) but I expect that it's still hard to do this efficiently especially when you typically need to allocate the memory long before you have any idea of what the usage pattern will be. In some very common patterns, I wouldn't be surprised to find that we had slowed down performance and increased peak memory footprint by attempting to implement this kind of optimization.

The only part of this that I think is potentially feasible is a "read-only" Tree where the buffers reference a single contiguous piece of memory. However, in this case the majority of the heap-allocated data lives at the LeafNode, so I think I would prefer a custom LeafNode implementation than adding a new allocator template argument to all levels of the Tree. Having said that, we have something similar in NanoVDB so not even sure if that makes sense in VDB too.

YoshuaNava commented 2 years ago

Hi, I'm testing OpenVDB for representing data with 3D localization & mapping as a direct application.

Out-of-the-box VDB works really well as a data storage mechanism, but it's leaf-allocation strategy means that in order to form a local tree, to e.g. perform nearest-neighbor search on the structure (which is a fairly common 3D localization task usually done with a KD-tree), there is a significant overhead. Thus, there is a shifts in resource utilization from the act of building a KD-tree, to that of allocating and deallocating a VDB with local environmental data.

The read-only tree proposed in https://github.com/AcademySoftwareFoundation/openvdb/issues/1002#issuecomment-809723426 would be an enabler to form trees from arbitrarily sampled data.

@danrbailey do you think this could be an interesting feature for OpenVDB? I would be glad to try and implement if you see value in it :slightly_smiling_face:

Idclip commented 2 years ago

@YoshuaNava out of interest what OS/allocator combination are you using? Have you tried profiling with jemalloc?

YoshuaNava commented 2 years ago

@Idclip

out of interest what OS/allocator combination are you using?

Ubuntu 20.04. In terms of allocation, I'm running the default OpenVDB tree allocator, no modification. Do note that my tests rely on a custom built VoxelContainer.

Have you tried profiling with jemalloc?

I have profiled Hotspots with Intel VTune, which lead to the following call-stack:

I'm building a relatively sparse grid 40-60 times per second, that resembles this sampling pattern (top left, figure a), where most of the occupancy happens at the leaf level, for the points, and the space in between the sensor frame and the captured points is not explicitly represented as allocated free-space. The grid has 7k to 10k points.

I was unaware of jemalloc and its profiling capabilities, but I could run a test. Should I use any specific configuration / setup?

Idclip commented 2 years ago

Have you tried profiling with jemalloc?

I meant have you tried using jemalloc as your memory allocator and profiling with that. jemalloc is a memory allocator that tends to be the go to choice for fast concurrent heap management. It may not be feasible for you to use it if you're using VDB in a wider system, but if you're not already using a scalable concurrent allocator I would strongly recommend trying jemalloc out:

https://packages.ubuntu.com/focal/libjemalloc-dev

VDB libraries will NOT link against a specific memory allocator (in other words, you don't need to rebuild libopenvdb). You'll need to either link this into your executabe or use preloading to make sure your application uses it.

dokipen3d commented 2 years ago

I'm not sure if this is even remotely close to what you want, but I've experimented with creating a vdb grid that is simply a view on existing allocated data...

https://github.com/AcademySoftwareFoundation/openvdb/discussions/1157

I wouldn't recommend using this in production as it relies on an internal friend class used for testing. it's a complete hack. although I wouldn't mind official support for setting a leaf nodes data pointer directly.

I'm using it to serialise another tiled voxel data structure. but it would allow you to create the vdb topology that just points to already allocated data that can be as contiguous as you want. I'm actually using plf::colony (current being proposed for the standard library as std::hive) which uses chunks of elements which gives a lot of cache friendly benefits as well as fast insertion/deletion with pool like memory reuse.

AcademySoftwareFoundation / openvdb