octree creation consumes too much memory

samuele-bortolato commented 1 year ago

Thanks again for the amazing library!

I've been using it with other students for a project for my master degree. The problem is that as it's usual for students, we don't have super powerfull machines, and we had to use laptops with 2 or 4GB of VRAM.

I know these kinf of machines don't offer the best experience possible and even simple runs will take a lot of times but they should at least work, at least lowering a lot the quality.

The problem is that most of the grids in the library require octrees, and the kaolin implementation for creating the octrees is highly inefficient in terms of space.

Digging a bit deeper I think I think the cause is in _ops/conversions/mesh_to_spc/mesh_tospc.cpp from kaolin

at::Tensor points_to_octree(
    at::Tensor points,
    uint32_t level) {
#ifdef WITH_CUDA
    uint32_t psize = points.size(0);
    at::Tensor morton = at::zeros({KAOLIN_SPC_MAX_POINTS}, points.options().dtype(at::kLong));
    at::Tensor info = at::zeros({KAOLIN_SPC_MAX_POINTS}, points.options().dtype(at::kInt));
    at::Tensor psum = at::zeros({KAOLIN_SPC_MAX_POINTS}, points.options().dtype(at::kInt));
    at::Tensor octree = at::zeros({KAOLIN_SPC_MAX_OCTREE}, points.options().dtype(at::kByte));
    at::Tensor pyramid = at::zeros({2, level+2}, at::device(at::kCPU).dtype(at::kInt));

    point_data* d_points = reinterpret_cast<point_data*>(points.data_ptr<short>());
    morton_code* d_morton = reinterpret_cast<morton_code*>(morton.data_ptr<int64_t>());
    uint32_t*  d_info = reinterpret_cast<uint32_t*>(info.data_ptr<int>());
    uint32_t*  d_psum = reinterpret_cast<uint32_t*>(psum.data_ptr<int>());
    uchar* d_octree = octree.data_ptr<uchar>();
    int*  h_pyramid = pyramid.data_ptr<int>();
    void* d_temp_storage = NULL;
    uint64_t temp_storage_bytes = GetStorageBytes(d_temp_storage, d_morton, d_morton, KAOLIN_SPC_MAX_POINTS);
    at::Tensor temp_storage = at::zeros({(int64_t)temp_storage_bytes}, points.options().dtype(at::kByte));
    d_temp_storage = (void*)temp_storage.data_ptr<uchar>();

    uint32_t osize = PointToOctree(d_points, d_morton, d_info, d_psum, d_temp_storage, temp_storage_bytes,
            d_octree, h_pyramid, psize, level);

    return octree.index({Slice(KAOLIN_SPC_MAX_OCTREE - osize, None)});
#else
  AT_ERROR("points_to_octree not built with CUDA");
#endif
}

that initializes the buffer tensors at the maximum size regardless of the size of the input and of the available size, making our low memory gpus crash OOM (it triees to initialize a few GB of memory).

In order to make it work on our laptops we made two changes in _wisp/accelstructs/octreeas.py, :

modify the make_dense function (in order to be able to create the octree at the beginning), that was calling points_to_octree without any reason since we already know the size of the octree and it's filled with 255.

@classmethod
def make_dense(cls, level) -> OctreeAS:
    """ Builds the acceleration structure and initializes full occupancy of all cells.

    Args:
        level (int): The depth of the octree.
    """
    t=0
    for i in range(level):
        t+=(2**i)**3
    octree = torch.ones(t, dtype=torch.uint8, device='cuda')*255
    #octree = wisp_spc_ops.create_dense_octree(level)

    return OctreeAS(octree)

modify from_quantized_points (in oder to be able to prune the structure), here we just plainly had to rewrite the function in a way that didn't crash the gpu


import numpy as np
def quantized_to_octree(quantized_points, level):

if quantized_points.device!='cpu':
    quantized_points=quantized_points.detach().cpu()

pts=np.array(torch.unique(quantized_points,dim=0), dtype=np.uint8)[:,None]
bits = np.unpackbits(pts, 1)[...,[0,1,2]]
bits = bits.reshape(len(bits),-1)

m_idx=bits@np.power(2,np.arange(24)[ : :-1])

oct=np.zeros((2**level)**3, dtype=bool)
oct[m_idx]=1
octree=[]

for _ in range(level):
    octL = np.packbits(oct.reshape(-1,8),-1,bitorder='little')
    oct = octL>0
    octree.insert(0,octL[oct])

return torch.tensor(np.concatenate(octree), dtype=torch.uint8, device='cuda')

...

def from_quantized_points(cls, quantized_points, level) -> OctreeAS: """ Builds the acceleration structure from quantized (integer) point coordinates.

    Args:
        quantized_points (torch.LongTensor): 3D coordinates of shape [num_coords, 3] in
                                             integer coordinate space [0, 2**level]
        level (int): The depth of the octree.
    """
    #octree = spc_ops.unbatched_points_to_octree(quantized_points, level, sorted=False)
    octree = quantized_to_octree(quantized_points, level)
    return OctreeAS(octree


we had to use numpy because torch doesn't have an equivalent for _packbits_ and _unpackbits_.

The implementation of make_dense should be pretty close to optimal (unless there is a closed form formula to compute the number of elementr of the octree, but the improvement is marginal).
The implementation of from_quantized_points is definitely not optimal, a custom cuda kernel would do the job way better, but I don't have the time at the moment to write one, and I probably don't have enough experience in cuda to optimize it properly.

I'm not sure if the implementation done in that way was done to achive maximum speed at the expense of memory, but there should be an option to use a low memory 
algorithm when there is not enough VRAM. 
I don't know if you are also in touch with the kaolin team, and also to them the problem, but would be a nice to have if there was a flag to turn on low memory octree computation.

Hope I helped sombody

Also, I'm not really used to github yet, while we were working of the project we realized there were several small things we would have changed in order to make the library more flexible or more efficient. Should I make a different post for each one of them or can I just dump them in a single issue? or should I use pull requests?

Thanks again for your time

samuele-bortolato commented 1 year ago

Actually, there was no need to compute the unique,

pts=np.array(torch.unique(quantized_points,dim=0), dtype=np.uint8)[:,None]

can be modified in

pts=np.array(quantized_points, dtype=np.uint8)[:,None]

orperel commented 1 year ago

Hi @samuele-bortolato , thanks for the useful feedback! Your issue is relevant to kaolin, but kaolin's maintainers overlap with kaolin-wisp so I just forwarded your message.

Feel free to post any other suggestions or issues on kaolin's github: https://github.com/NVIDIAGameWorks/kaolin

NVIDIAGameWorks / kaolin-wisp

octree creation consumes too much memory #125