dendenxu / fast-gaussian-rasterization

A geometry-shader-based, global CUDA sorted high-performance 3D Gaussian Splatting rasterizer. Can achieve a 5-10x speedup in rendering compared to the vanialla diff-gaussian-rasterization.
Other
218 stars 2 forks source link
3dgs 4dgs nerf rasterization shaders

Fast Gaussian Rasterization

https://github.com/dendenxu/fast-gaussian-splatting/assets/43734697/f50afd6f-bbd5-4e18-aca6-a7356a5d3f75

No backward pass is supported yet. Will think of ways to add a backward. Depth-peeling (4K4D) is too slow. Discussion welcomed.

Installation

Install the latest release from PyPI:

pip install fast_gauss

Or the latest commit from GitHub:

pip install git+https://github.com/dendenxu/fast-gaussian-rasterization

No CUDA compilation is required to build fast_gauss since we're only shader-based for now.

Usage

Replace the original import of diff_gaussian_rasterization with fast_gauss.

For example, replace this:

from diff_gaussian_rasterization import GaussianRasterizationSettings, GaussianRasterizer

with this:

from fast_gauss import GaussianRasterizationSettings, GaussianRasterizer

And you're good to go.

Tips

Note: for the ultimate 5-10x performance increase, you'll need to let fast_gauss's shader directly write to your desired framebuffer.

Currently, we are trying to automatically detect whether you're managing your own OpenGL context (i.e. opening up a GUI) by checking for the module OpenGL during the import of fast_gauss. If detected, all rendering commands will return Nones and we will directly write to the bound framebuffer at the time of the draw call. Thus if you're running in a GUI (OpenGL-based) environment, the output of our rasterizer will be Nones and does not require further processing.

Note: the speedup is the most visible when the pixel-to-point ratio is high.

That is, when there are large Gaussians and very high-resolution rendering, the speedup is more visible. The CUDA-based software implementation is more resolution sensitive and for some extremely dense point clouds (> 1 million points), the CUDA implementation might be faster. This is because the typical rasterization-based pipeline on modern graphics hardware is not well-optimized for small triangles.

Note: for best performance, cache the persistent results (for example, the 6 elements of the covariance matrix).

This is more of a general tip and not directly related to fast_gauss. However, the impact is more observable here since we haven't implemented a fast 3D covariance computation (from scales and rotations) in the shader yet. Only PyTorch implementation is available for now.

When the point count increases, even the smallest precomputation can help. An example is the concatenation of the base 0-degree SH parameter and the rest, that small maneuver might cost us 10ms on a 3060 with 5 million points. Thus, store the concatenated tensors instead and avoid concatenating them in every frame.

Note: it's recommended to pass in a CPU tensor in the GaussianRasterizationSettings to avoid explicit synchronizations for even better performance.

Note: the second output of the GaussianRasterizer is not radii anymore (since we're not gonna use it for the backward pass), but the alpha values of the rendered image instead.

And the alpha channel content seems to be bugged currently, will debug.

TODOs

Implementation

Guidelines

Why does a global sort work?

The OpenGL specification is somewhat vague but there's this reference: (in the 4th paragraph of section 2.1 of chapter 2 of this specification: https://registry.khronos.org/OpenGL/specs/gl/glspec44.core.pdf)

Commands are always processed in the order in which they are received, although there may be an indeterminate delay before the effects of a command are realized. This means, for example, that one primitive must be drawn completely before any subsequent one can affect the framebuffer.

Thus if the order of the data in the vertex buffer (or as specified by an index buffer) is back-to-front, and alpha blending is enabled, you can count on OpenGL to correctly update the framebuffer in the correct back to front order.

Environment

This project requires you to have an NVIDIA GPU with the ability to interop between CUDA and OpenGL. Thus, WSL is not supported and OSX (MacOS) is not supported. Tested on Linux and Windows.

For offline rendering (the drop-in replacement of the original CUDA rasterizer), we also need a valid EGL environment. It can sometimes be hard to set up for virtualized machines. Potential fix.

Credits

Inspired by those insanely fast WebGL-based 3DGS viewers:

Using the algorithm and improvements from:

CUDA-GL interop & EGL environment inspired by:

Citation

@misc{fast_gauss,  
    title = {Fast Gaussian Rasterization},
    howpublished = {GitHub},  
    year = {2024},
    url = {https://github.com/dendenxu/fast-gaussian-rasterization}
}