NVIDIAGameWorks / kaolin-wisp

NVIDIA Kaolin Wisp is a PyTorch library powered by NVIDIA Kaolin Core to work with neural fields (including NeRFs, NGLOD, instant-ngp and VQAD).
Other
1.45k stars 133 forks source link

octree creation consumes too much memory #125

Open samuele-bortolato opened 1 year ago

samuele-bortolato commented 1 year ago

Thanks again for the amazing library!

I've been using it with other students for a project for my master degree. The problem is that as it's usual for students, we don't have super powerfull machines, and we had to use laptops with 2 or 4GB of VRAM.

I know these kinf of machines don't offer the best experience possible and even simple runs will take a lot of times but they should at least work, at least lowering a lot the quality.

The problem is that most of the grids in the library require octrees, and the kaolin implementation for creating the octrees is highly inefficient in terms of space.

Digging a bit deeper I think I think the cause is in _ops/conversions/mesh_to_spc/mesh_tospc.cpp from kaolin

at::Tensor points_to_octree(
    at::Tensor points,
    uint32_t level) {
#ifdef WITH_CUDA
    uint32_t psize = points.size(0);
    at::Tensor morton = at::zeros({KAOLIN_SPC_MAX_POINTS}, points.options().dtype(at::kLong));
    at::Tensor info = at::zeros({KAOLIN_SPC_MAX_POINTS}, points.options().dtype(at::kInt));
    at::Tensor psum = at::zeros({KAOLIN_SPC_MAX_POINTS}, points.options().dtype(at::kInt));
    at::Tensor octree = at::zeros({KAOLIN_SPC_MAX_OCTREE}, points.options().dtype(at::kByte));
    at::Tensor pyramid = at::zeros({2, level+2}, at::device(at::kCPU).dtype(at::kInt));

    point_data* d_points = reinterpret_cast<point_data*>(points.data_ptr<short>());
    morton_code* d_morton = reinterpret_cast<morton_code*>(morton.data_ptr<int64_t>());
    uint32_t*  d_info = reinterpret_cast<uint32_t*>(info.data_ptr<int>());
    uint32_t*  d_psum = reinterpret_cast<uint32_t*>(psum.data_ptr<int>());
    uchar* d_octree = octree.data_ptr<uchar>();
    int*  h_pyramid = pyramid.data_ptr<int>();
    void* d_temp_storage = NULL;
    uint64_t temp_storage_bytes = GetStorageBytes(d_temp_storage, d_morton, d_morton, KAOLIN_SPC_MAX_POINTS);
    at::Tensor temp_storage = at::zeros({(int64_t)temp_storage_bytes}, points.options().dtype(at::kByte));
    d_temp_storage = (void*)temp_storage.data_ptr<uchar>();

    uint32_t osize = PointToOctree(d_points, d_morton, d_info, d_psum, d_temp_storage, temp_storage_bytes,
            d_octree, h_pyramid, psize, level);

    return octree.index({Slice(KAOLIN_SPC_MAX_OCTREE - osize, None)});
#else
  AT_ERROR("points_to_octree not built with CUDA");
#endif
}

that initializes the buffer tensors at the maximum size regardless of the size of the input and of the available size, making our low memory gpus crash OOM (it triees to initialize a few GB of memory).

In order to make it work on our laptops we made two changes in _wisp/accelstructs/octreeas.py, :

...

def from_quantized_points(cls, quantized_points, level) -> OctreeAS: """ Builds the acceleration structure from quantized (integer) point coordinates.

    Args:
        quantized_points (torch.LongTensor): 3D coordinates of shape [num_coords, 3] in
                                             integer coordinate space [0, 2**level]
        level (int): The depth of the octree.
    """
    #octree = spc_ops.unbatched_points_to_octree(quantized_points, level, sorted=False)
    octree = quantized_to_octree(quantized_points, level)
    return OctreeAS(octree

we had to use numpy because torch doesn't have an equivalent for _packbits_ and _unpackbits_.

The implementation of make_dense should be pretty close to optimal (unless there is a closed form formula to compute the number of elementr of the octree, but the improvement is marginal).
The implementation of from_quantized_points is definitely not optimal, a custom cuda kernel would do the job way better, but I don't have the time at the moment to write one, and I probably don't have enough experience in cuda to optimize it properly.

I'm not sure if the implementation done in that way was done to achive maximum speed at the expense of memory, but there should be an option to use a low memory 
algorithm when there is not enough VRAM. 
I don't know if you are also in touch with the kaolin team, and also to them the problem, but would be a nice to have if there was a flag to turn on low memory octree computation.

Hope I helped sombody

Also, I'm not really used to github yet, while we were working of the project we realized there were several small things we would have changed in order to make the library more flexible or more efficient. Should I make a different post for each one of them or can I just dump them in a single issue? or should I use pull requests?

Thanks again for your time
samuele-bortolato commented 1 year ago

Actually, there was no need to compute the unique,

pts=np.array(torch.unique(quantized_points,dim=0), dtype=np.uint8)[:,None]

can be modified in

pts=np.array(quantized_points, dtype=np.uint8)[:,None]
orperel commented 1 year ago

Hi @samuele-bortolato , thanks for the useful feedback! Your issue is relevant to kaolin, but kaolin's maintainers overlap with kaolin-wisp so I just forwarded your message.

Feel free to post any other suggestions or issues on kaolin's github: https://github.com/NVIDIAGameWorks/kaolin