brandonpelfrey / Fast-BVH

A Simple, Optimized Bounding Volume Hierarchy for Ray/Object Intersection Testing
MIT License
509 stars 72 forks source link

Added Morton Curve #21

Closed ghost closed 4 years ago

ghost commented 4 years ago

This PR introduces a Morton curve class. It can be considered a step forward in addressing item 4 from #14, the LDC BVH build algorithm.

A Morton curve is space filling curve that's useful in the construction of LBVHs. It creates a series of "Morton codes" that are encoded with the X, Y, and Z values of each primitives center point. The array returned by MortonCurve can be used to sort the primitives (or indices to the primitives) so that they appear close together from start to finish. Here's an illustration.

This example begins at the origin and ends up the upper left quadrant

All three floating point types are supported in that they each provide a level of accuracy that corresponds to the floating point precision. The accuracy determines how closely the Morton code can pin point the center of a primitive. Here's the breakdown of their accuracy:

Floating Point Type Morton Code Accuracy
float 0.90234375
double 0.9999523162841796875
long double 0.9999999999772626324556767940521240234375

What's currently missing from this PR is:

Benchmarking

There needs to be a benchmark for this, using all types. This will help determine whether or not the loops are being vectorized. It should be that:

There may be subtle performance differences due to caching effects.

The benchmark will ideally be a random set of generated primitives, preferably in the order of a million.

I'm holding off on this until I figure out how the algorithm can be done in parallel.

Parallelism

If you look at the algorithm, looks extremely easy to write in parallel. It performs two passes over the primitives, in two different loops. The only challenge here is that I'd like to support CUDA as well as standard CPU targets. This may end up being a separate function for either OpenMP and CUDA or it might end up being something different. We'll see!

Fast Mode

There currently is two passes done in the curve algorithm. This is required because the minimum and maximum boundaries of the center points have to be calculated. Sometimes, though, callers might already have the scene min and max available. These values should be able to get passed to the curve function so that the scene min and max can get used and only one pass has to occur. The resolutions of the Morton codes would probably go down a slight bit, but the performance would be much better.

ghost commented 4 years ago

Upon evaluating this a bit more, I've found out that it's not easily distributed to more than one block on a GPU. While it is scalable on CPU architecture, it's not on GPU architecture. Please hold off on merging until I make a follow up commit, addressing the GPU issues.

The problem is the first pass requires iteration across the whole list of primitives. This does not scale for calls that only process a certain set of primitives. What's going to end up being changed is that the "fast mode" I described earlier will be the way it's implemented. High quality Morton codes can still be achieved as they were in the first commit by making the first pass in a separate "pre-processing" stage.

Consequentially, the next commit will also have the benchmark I've written for CPU and CUDA targets. This was something I had to write in order to make sure that auto-vectorization was taking place and that the CUDA code was effective.

As it currently stands, here are the CPU benchmarks for single threaded calls to generate a Morton curve for a million randomly generated primitives (GCC):

Float Type Seconds to Completion
float-32 0.0126
float-64 0.0338
float-80 0.1717

Clearly, there's something wrong with the float-128 code. It should be performing at approximately 2 * 0.0338, which would be 0.0676. This may be something I also address in the next commit. This was a misunderstanding on my part. There's no 128-bit float on x86 hardware. The long double is an 80-bit type and has no SIMD instruction set.

Furthermore, it turns out that GCC was not vectorizing the loops for the Morton curve. GCC can't vectorize Morton curves because it doesn't support floating point to integer conversions, indicated by -fopt-info-missed-vec. Fortunately, Clang does. Here's the recent CPU benchmarks for the next commit for Clang and GCC.

Compiler Float Type Seconds to Completion
GCC float-32 0.0074
GCC float-64 0.0239
GCC float-80 0.1591
Clang float-32 0.0021
Clang float-64 0.0036
Clang float-80 0.0165

The GCC benchmark is faster from the first benchmark because I also rewrote the curve function to be more cache friendly.

I'll push the new commit once I'm done fixing the CUDA benchmark.

ghost commented 4 years ago

In the last couple of commits, I've made the following changes:

I believe that sums it up. There might be more optimization opportunities with the Morton curve that we can work on later on, but for right now I'll be focusing on the next PR for LBVH algorithm.

ghost commented 4 years ago

Closing this PR. In hindsight, this doesn't really bring anything valuable to the library (yet.) The LBVH build algorithm is coming along and I'll probably have a PR for it at some point this weekend. That would probably be a better point at which to merge since it would bring usable features to the library, and it would include everything in this PR as well.