Closed relief7 closed 8 months ago
My default response is typically against coding for performance improvements, since we want readers to implement their own code, and so we strive for a simple, easy-to-understand reference implementation. If I really want to do this, then I'm inclined to backlog items like this for future releases, as we're desperately trying to get v4 out the door.
But darn it, you present such a good case for this, and in the end I completely agree with you that your proposal is much simpler to understand, with large performance benefits as a free side effect. We gotta do this. Thank you very much for putting so much care and effort into your proposal.
How would you like to proceed? Would you like to craft this and create a PR? No worries at all if you'd rather we just do it.
By the way, if you wanted to tackle this PR, another possible simplification/optimization is to assume that the hittable object lifetime is greater than the BVH construction, and use raw pointers instead of shared_ptr
.
While we're digging in this area of the code, see also #1328, something we might do at the same time.
Hi Steve!
Thanks a lot for your reply and your consideration! I am glad if the proposal finds interest! :smile:
If I really want to do this, then I'm inclined to backlog items like this for future releases, as we're desperately trying to get v4 out the door.
Backlogging is totally fine with me, there is really no rush. If you guys are busy with getting the current release out, don't worry about it. There is always time to revisit it at a later point. I just filed the issue so it is at least written down somewhere "officially" where it can be tracked.
By the way, if you wanted to tackle this PR, another possible simplification/optimization is to assume that the hittable object lifetime is greater than the BVH construction, and use raw pointers instead of
shared_ptr
.
I tried to have a look at your suggestions today and ran a few tests using the final scene of Book Nr.2 as a "test bed". It fits quite well because there is this group of white spheres generated randomly where you can easily change the number of generated hittables:
So to get a rough estimate how the different bvh splitting methods behave, I varied the hittable number and wrote down how long it takes to build the bvh.
As a disclaimer, I am not a hardcore optimization guy, as I don't have a lot of C++ experience. In fact, I am not a big fan of code that is as optimized as possible but at the same time super hard to understand, at least in hobby projects. So I am not 100% sure if I am the right guy for this. :smile: The tests here are not done in a very scientific way. :smile: But maybe we can get a few rough numbers, even though I totally agree that optimization should not be a top priority for this tutorial series at all. But anything we can get "for free" without sacrificing readability, I think we should totally address.
I ran 4 different implementations:
Findings in a nutshell (and my "layman" interpretations :smile:):
std::nth_element
algorithm as the splitting method from issue # 1328 in fact brings bvh construction time down but it needs to be acknowledged here that this algorithm requires C++ 17 compiler flag whereas current repo is set to use C++ 11std::nth_element
method here also decreases bvh build time even further but again would need C++ 17I created two pull requests (one for Nr. 3 and one for Nr. 4):
Please note: #1392 would also require change in compilation flags to include C++ 17 (I did not touch these).
If you decide to go with Nr. 2, it would be literally just one changed line of code so I don't think you would need a pull request for that, right?
By the way, if you wanted to tackle this PR, another possible simplification/optimization is to assume that the hittable object lifetime is greater than the BVH construction, and use raw pointers instead of
shared_ptr
.
You have a really good point with the raw pointer approach! But to be honest, I haven't had the time to fully test this anywhere yet as it would possibly require some bigger changes in more places (bvh class members, function arguments in comparators etc.). What I find interesting... According to this post on stack overflow:
shared_ptrs don't seem to have a time overhead in de-referencing (=in getting the reference to owned object) so we might not gain much during the actual render but possibly a bit during the bvh build? On the other hand, they might come a bit at the cost of readability / "ease of use" in the code or a higher entry barrier for C++ intermediates? And a bigger pain to debug? So I am a bit torn to really go down that route. I think it would require some testing to see if it's worth it.
I can see if I have some time in the future to look into it further but I think in the meantime, we already have two really good alternatives with Nr.3 and Nr.4 for improved speed and ease of use without sacrificing readability, should we run into any issues with the current bvh implementation. π
Of course while simplyfing the overall bvh build process and make it more simple, which should be the primary goal of this operation. π
Or what do you think?
@relief7, That's phenomenal! Thank you for such a detailed explanation and PRs.
I've played a bit with your implementations, even slightly more. In C++ 17 we also have parallel versions of algorithms in question, so I've tried to enable it as well.
- std::nth_element(objects.begin(), objects.begin() + mid, objects.end(), comparator);
+ std::nth_element(std::execution::par, objects.begin(), objects.begin() + mid, objects.end(), comparator);
The ray-tracing framework is still more or less usable for the 500'000 spheres in the box, using your both, β3 and β4, implementations.
The implementation β3 is approximately twice slower. The rough (in seconds) BVH construction time for 500'000 spheres on my machine is:
std::sort "divide and conquer" = 5s
std::sort "divide and conquer" parallel = 5s
std::nth_element "divide and conquer" = 3s
std::nth_element "divide and conquer" parallel = 3s
std::ranges::nth_element non-parallel "divide and conquer" from ranges = 4s (from C++20)
Parallel or non-parallel algorithms seem doesn't add any improvements. (maybe just because I don't know how to use it properly π )
Wow, this is pretty cool! I didn't know that these algorithms had parallel execution support. Thanks for bringing that up! Looks like it is also a C++ 17 feature. Not sure what platform you are on, but after some digging, I found this for my platform (Win64, MinGW, g++):
https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode_using.html
It says you need to turn on these compiler flags:
-D_GLIBCXX_PARALLEL
to enable parallel execution-fopenmp
to enable multiprocessing via OpenMPAre you using these already in your tests? Could be different if you are using MSVC or if you are on Mac etc.
When testing this with 100,000 objects, I get these results:
sort()
), time is further reduced from 1.90sec to 1.08 secnth_element()
), time is of course faster than the full sort (~0.75 sec) but, like you, I get similar times between parallel execution and non-parallel execution and the parallel version has no real speed gainSo I guess the gist is that it depends a lot on the algorithm if you gain anything from using the parallel algorithms. Maybe the overhead for multi-threading in this case (administrating the threads, packaging tasks, copying smart pointers) does not buy enough time to make it worth it. Possibly other algorithms than nth_element
are a better fit for parallelization? If any optimization experts are here and know better, please correct me if I am wrong.π
Btw, other worthy compilation flags to look into are optimization flags like -O3
if you are not already using these anyways. These should help a lot with overall speed.
Nevertheless, I assume all these further optimizations are going way beyond the tutorial (and also this github issue), so I would leave it at that. Things like compiler optimizations are very specific and always depend on the reader's platform / compiler / c++ version etc. so it's hard to find a common denominator.
With a bvh build time of a few seconds and such a simple implementation, I would already be more than happy and I would then rather look at how to parallelize the render itself or make it faster as the tracing takes the bigger chunk of render time. I am more interested in understanding all the core concepts first (still need to do book nr.3!) than trying to squeeze every possible millisecond out of it. π
@relief7 Markus, thanks for your prompt response.
I'm using Linux with GCC and Windows with MSVC. Yep, in general, parallel execution works pretty well. I'm using it for the main ray tracing loop in this way, and the gain is as huge as your divide-and-conquer approach for BVH.
std::for_each(std::execution::par, image.begin(), image.end(), [&](vec2 &pixel) {
...
});
But I write pixel colour to the memory using STB Image, with a PPM file this approach won't work due to the inability to write to the exact file location.
Anyway, that's not my point. Parallel execution lies way beyond the scope of the books. I just wanted to express my gratitude to you and say that neither of your works was lost in the backlog, and some people digging into closed PRs reaching for pearls. π
Cheers!
Hi! based on Arman Uguray's / @armansito's suggestions in pull request #1391 , I made some further tests.
To quote his reply, he suggested to do the following changes:
As an alternative, if we really wanted to eliminate all the redundant copies and vector allocations, we could allow the BVH construction to sort the object list in-place without creating any intermediate vectors. The code would look very similar to the original listing but instead remove the
const
qualifier from thesrc_objects
declaration. If preserving the original order of the input objects is important, the input could be copied once and the in-place sort could operate over that initial copy (but I'm not sure if that's really necessary).
So the changes are:
const
qualifier for objects array, therefore bvh_node
constructor can and will change the array:
old: bvh_node(const std::vector<shared_ptr<hittable>>& objects, size_t start, size_t end)
new: bvh_node(std::vector<shared_ptr<hittable>>& objects, size_t start, size_t end)
Some rudimentary speed tests with this "in-place sorting":
Bvh construction for 10000 white spheres took: 0.115 seconds.
Bvh construction for 20000 white spheres took: 0.27726 seconds.
3 runs for 100,000 objects
Bvh construction for 100000 white spheres took: 1.84266 seconds.
Bvh construction for 100000 white spheres took: 1.78362 seconds.
Bvh construction for 100000 white spheres took: 1.7139 seconds.
500,000 objects
Bvh construction for 500000 white spheres took: 12.6192 seconds.
1,000,000 objects
Bvh construction for 1000000 white spheres took: 26.7226 seconds.
So compared to current code in the book, we gain again a lot of speed. Roughly ~65 times faster for 20,000 objects.
In terms of performance, it seems to be also faster than my previously proposed "divide & conquer" method.
Overall, I feel like @armansito's suggestion is probably the way to go because:
I also tried the changes suggested by you, Steve @hollasch :
for (auto& obj : objects)
instead of
for (size_t object_index=0; object_index < size; object_index++)
I tried this, but speed tanked by factor 20. Not sure what is causing this but since it made everything so much slower, I left it out for now. I even tried for (const auto & obj ...
but same problem. Feel free to revisit on your side, I could not get it to work with decent speed.size
definition down --> done!I will go ahead and create a new pull request with this method, hopefully everybody is happy!
bvh_node
should definitely not alter the input list of objects β that would invalidate any existing references to the hittable list.
Note that my earlier suggestion to use raw pointers created a copy of the input's shared_ptr
values, with the understanding that the bvh_node
s lifetime should be shorter or equal to the lifetime of the input hittable list. This addresses the problem of invalidating the caller's hittable list references, at the cost of a single array copy.
Done. See also #1416.
Hi!
first of all, I wanted to say a huge thank you to Peter Shirley for creating this fantastic tutorial series. It is amazing how intuitive and simple things are in his explanations and it is so much fun to follow. It is also fascinating what amazing results you get with so little code. I have been reading the series with great enthusiasm and already learned so much from it!
Also thank you to everyone here on github for all the maintenance work and valuable improvements to the code and the documentation. I can imagine it is hard work to keep everything updated and in sync so it is really greatly appreciated! :+1:
The reason why I am writing ... (I actually just signed up at github to try to comment here :smile:)
I have followed part 1 and 2 of the series, then spent some time adding a few additional features on my own (such as triangle mesh hittables, a different material etc.). Then my goal was to render the Stanford dragon test model. It is a triangle mesh and has roughly 900k triangles and that is when I suddenly hit a pretty severe bottleneck in the code. It was related to the bvh construction that it was impossible to even start rendering any pixels.
The problem seems that in every recursion step of the bvh construction, we make a copy of the entire vector of hittable pointers, even though we only need to process a handful of objects of those in the actual recursion step. So in my example, the pointers for all 900k triangles are copied in every split which led to the bvh construction taking forever (I cancelled it after a couple of hours).
After taking a closer look at the bvh construction code, I believe there is a way of optimizing the current code but at the same time making it easier to understand for beginners as well. This is why I am writing here. Because I think there is an actual benefit to review the current bvh code and make some changes. It's nothing crazy, only changing a few lines and you gain not only a huge speedup for potential large amounts of hittables but also it is a bit more intuitive as well (in my humble opinion). In the case of my dragon model, the bvh construction was reduced from many hours to ~10 seconds, yay! :smile:
The proposed change also eliminates the tracking of the start and end indices and divides the input array before each recursion level. This means that only the really needed objects are passed to the next level. In a true "divide and conquer" fashion. This avoids copying data that is not really needed, the array gets smaller and smaller on each level and the speedup is huge.
As an example, let's say you have 24 objects in your scene. In the current code, all 24 objects get passed down in every step down to the leaf construction level:
In the proposed new code, we would process 24, then split the array, then process 2x12, split, then process 4x6, split, process 8x3 etc.
This is also literally more true to what the bvh actually is - a space partitioning function - not an indexing into an array function. (It would be different if we were indexing into the array for speed purposes but we are not doing this here anyways as we are copying the vector over and over).
Of course this is all not really noticeable with small object counts but for large numbers, this change is huge.
Here is the proposed bvh_node function along with some comments:
As you can see, the function call is simplified and does not need
size_t start
andsize_t end
arguments anymore. We just hand over the array of objects and let it do its thing:We are still copying the input array but since it is getting halved in each recursion level, it is not too big of a deal anymore:
We can eliminate
object_span
as well because we are dealing with the whole array. So it just becomessize
:We don't need
start
andend
indices anymore to read from the array......and splitting the array for the next recursion into
objectsLeft
andobjectsRight
is super easy, thanks to the.begin()
and.end()
iterators:Now this is the critical part, we don't call the
bvh_node()
function again with the wholeobjects
array as it is done currently...... but instead, each recursion call is only with half of the original objects (
objectsLeft
andobjectsRight
):In the next recursion level we will then operate on the full
objectsLeft
andobjectsRight
, split each of them in half, ... and so on.In my humble opinion, this is super easy to understand, in fact even easier than the original function and keeping track of indices.
So you could argue that the tutorial code is not meant for rendering thousands of objects anyways and there is no real speed gain for the scenes here. But I still believe that this change is worth looking at because:
If there is any interest from your side in reviewing this, please let me know! From my side, it's ready to go!
Thank you for your consideration and happy rendering everyone!
P.S.: And here is my dragon, thanks to you guys! π