Closed ghost closed 4 years ago
Upon evaluating this a bit more, I've found out that it's not easily distributed to more than one block on a GPU. While it is scalable on CPU architecture, it's not on GPU architecture. Please hold off on merging until I make a follow up commit, addressing the GPU issues.
The problem is the first pass requires iteration across the whole list of primitives. This does not scale for calls that only process a certain set of primitives. What's going to end up being changed is that the "fast mode" I described earlier will be the way it's implemented. High quality Morton codes can still be achieved as they were in the first commit by making the first pass in a separate "pre-processing" stage.
Consequentially, the next commit will also have the benchmark I've written for CPU and CUDA targets. This was something I had to write in order to make sure that auto-vectorization was taking place and that the CUDA code was effective.
As it currently stands, here are the CPU benchmarks for single threaded calls to generate a Morton curve for a million randomly generated primitives (GCC):
Float Type | Seconds to Completion |
---|---|
float-32 | 0.0126 |
float-64 | 0.0338 |
float-80 | 0.1717 |
Clearly, there's something wrong with the This was a misunderstanding on my part. There's no 128-bit float on x86 hardware. The float-128
code. It should be performing at approximately 2 * 0.0338
, which would be 0.0676
. This may be something I also address in the next commit.long double
is an 80-bit type and has no SIMD instruction set.
Furthermore, it turns out that GCC was not vectorizing the loops for the Morton curve. GCC can't vectorize Morton curves because it doesn't support floating point to integer conversions, indicated by Here's the recent CPU benchmarks for the next commit for Clang and GCC.-fopt-info-missed-vec
. Fortunately, Clang does.
Compiler | Float Type | Seconds to Completion |
---|---|---|
GCC | float-32 | 0.0074 |
GCC | float-64 | 0.0239 |
GCC | float-80 | 0.1591 |
Clang | float-32 | 0.0021 |
Clang | float-64 | 0.0036 |
Clang | float-80 | 0.0165 |
The GCC benchmark is faster from the first benchmark because I also rewrote the curve function to be more cache friendly.
I'll push the new commit once I'm done fixing the CUDA benchmark.
In the last couple of commits, I've made the following changes:
disasm/
.I believe that sums it up. There might be more optimization opportunities with the Morton curve that we can work on later on, but for right now I'll be focusing on the next PR for LBVH algorithm.
Closing this PR. In hindsight, this doesn't really bring anything valuable to the library (yet.) The LBVH build algorithm is coming along and I'll probably have a PR for it at some point this weekend. That would probably be a better point at which to merge since it would bring usable features to the library, and it would include everything in this PR as well.
This PR introduces a Morton curve class. It can be considered a step forward in addressing item 4 from #14, the LDC BVH build algorithm.
A Morton curve is space filling curve that's useful in the construction of LBVHs. It creates a series of "Morton codes" that are encoded with the X, Y, and Z values of each primitives center point. The array returned by
MortonCurve
can be used to sort the primitives (or indices to the primitives) so that they appear close together from start to finish. Here's an illustration.This example begins at the origin and ends up the upper left quadrant
All three floating point types are supported in that they each provide a level of accuracy that corresponds to the floating point precision. The accuracy determines how closely the Morton code can pin point the center of a primitive. Here's the breakdown of their accuracy:
What's currently missing from this PR is:
Benchmarking
There needs to be a benchmark for this, using all types. This will help determine whether or not the loops are being vectorized. It should be that:
float
is twice as fast asdouble
, four times as fast aslong double
double
is twice as fast aslong double
There may be subtle performance differences due to caching effects.
The benchmark will ideally be a random set of generated primitives, preferably in the order of a million.
I'm holding off on this until I figure out how the algorithm can be done in parallel.
Parallelism
If you look at the algorithm, looks extremely easy to write in parallel. It performs two passes over the primitives, in two different loops. The only challenge here is that I'd like to support CUDA as well as standard CPU targets. This may end up being a separate function for either OpenMP and CUDA or it might end up being something different. We'll see!
Fast Mode
There currently is two passes done in the curve algorithm. This is required because the minimum and maximum boundaries of the center points have to be calculated. Sometimes, though, callers might already have the scene min and max available. These values should be able to get passed to the curve function so that the scene min and max can get used and only one pass has to occur. The resolutions of the Morton codes would probably go down a slight bit, but the performance would be much better.