RayTracing / raytracing.github.io

Main Web Site (Online Books)
https://raytracing.github.io/
Creative Commons Zero v1.0 Universal
8.84k stars 868 forks source link

Faster and simpler BVH construction #1388

Closed relief7 closed 8 months ago

relief7 commented 9 months ago

Hi!

first of all, I wanted to say a huge thank you to Peter Shirley for creating this fantastic tutorial series. It is amazing how intuitive and simple things are in his explanations and it is so much fun to follow. It is also fascinating what amazing results you get with so little code. I have been reading the series with great enthusiasm and already learned so much from it!

Also thank you to everyone here on github for all the maintenance work and valuable improvements to the code and the documentation. I can imagine it is hard work to keep everything updated and in sync so it is really greatly appreciated! :+1:

The reason why I am writing ... (I actually just signed up at github to try to comment here :smile:)

I have followed part 1 and 2 of the series, then spent some time adding a few additional features on my own (such as triangle mesh hittables, a different material etc.). Then my goal was to render the Stanford dragon test model. It is a triangle mesh and has roughly 900k triangles and that is when I suddenly hit a pretty severe bottleneck in the code. It was related to the bvh construction that it was impossible to even start rendering any pixels.

The problem seems that in every recursion step of the bvh construction, we make a copy of the entire vector of hittable pointers, even though we only need to process a handful of objects of those in the actual recursion step. So in my example, the pointers for all 900k triangles are copied in every split which led to the bvh construction taking forever (I cancelled it after a couple of hours).

After taking a closer look at the bvh construction code, I believe there is a way of optimizing the current code but at the same time making it easier to understand for beginners as well. This is why I am writing here. Because I think there is an actual benefit to review the current bvh code and make some changes. It's nothing crazy, only changing a few lines and you gain not only a huge speedup for potential large amounts of hittables but also it is a bit more intuitive as well (in my humble opinion). In the case of my dragon model, the bvh construction was reduced from many hours to ~10 seconds, yay! :smile:

The proposed change also eliminates the tracking of the start and end indices and divides the input array before each recursion level. This means that only the really needed objects are passed to the next level. In a true "divide and conquer" fashion. This avoids copying data that is not really needed, the array gets smaller and smaller on each level and the speedup is huge.

As an example, let's say you have 24 objects in your scene. In the current code, all 24 objects get passed down in every step down to the leaf construction level:

bvh_01

In the proposed new code, we would process 24, then split the array, then process 2x12, split, then process 4x6, split, process 8x3 etc.

bvh_02

This is also literally more true to what the bvh actually is - a space partitioning function - not an indexing into an array function. (It would be different if we were indexing into the array for speed purposes but we are not doing this here anyways as we are copying the vector over and over).

Of course this is all not really noticeable with small object counts but for large numbers, this change is huge.

Here is the proposed bvh_node function along with some comments:


// it is no longer necessary to pass indices as arguments
// because in each recursion step, we are processing the entire given src_objects vector.
bvh_node(const std::vector<shared_ptr<hittable>>& src_objects) {
    auto objects = src_objects; // Create a modifiable array of the source scene objects

    int axis = random_int(0,2);
    auto comparator = (axis == 0) ? box_x_compare : (axis == 1) ? box_y_compare : box_z_compare;

    // how many objects are in the given array?
    size_t size = objects.size ();

    if (size == 1) {
        left = right = objects[0];
    } else if (size == 2) {
        if (comparator(objects[0], objects[1])) {
            left = objects[0];
            right = objects[1];
        } else {
            left = objects[1];
            right = objects[0];
        }
    } else {
        std::sort(objects.begin(), objects.end(), comparator);

        // where do we need to split?
        auto mid = size/2;

        // true "divide and conquer" method: split objects vector into two new vectors
        // left:    0 --> mid - 1
        // right: mid --> end

        auto objectsLeft = std::vector<std::shared_ptr<hittable>>(objects.begin(), objects.begin()+mid);
        auto objectsRight = std::vector<std::shared_ptr<hittable>>(objects.begin()+mid, objects.end());

        // only pass the necessary objects into next recursion step,
        // with each step, the objects vector is getting smaller and therefore faster to process
        left = make_shared<bvh_node>(objectsLeft);
        right = make_shared<bvh_node>(objectsRight);
    }

    bbox = aabb(left->bounding_box(), right->bounding_box());
}

As you can see, the function call is simplified and does not need size_t start and size_t end arguments anymore. We just hand over the array of objects and let it do its thing:

bvh_node(const std::vector<shared_ptr<hittable>>& src_objects)

We are still copying the input array but since it is getting halved in each recursion level, it is not too big of a deal anymore:

auto objects = src_objects; // Create a modifiable array of the source scene objects

We can eliminate object_span as well because we are dealing with the whole array. So it just becomes size:

size_t size = objects.size ();

We don't need start and end indices anymore to read from the array...

left = objects[1];
right = objects[0];

...and splitting the array for the next recursion into objectsLeft and objectsRight is super easy, thanks to the .begin() and .end() iterators:

auto objectsLeft = std::vector<std::shared_ptr<hittable>>(objects.begin(), objects.begin()+mid);
auto objectsRight = std::vector<std::shared_ptr<hittable>>(objects.begin()+mid, objects.end());

Now this is the critical part, we don't call the bvh_node() function again with the whole objects array as it is done currently...

left = make_shared<bvh_node>(objects, start, mid);
right = make_shared<bvh_node>(objects, mid, end);

... but instead, each recursion call is only with half of the original objects (objectsLeft and objectsRight):

left = make_shared<bvh_node>(objectsLeft);
right = make_shared<bvh_node>(objectsRight);

In the next recursion level we will then operate on the full objectsLeft and objectsRight, split each of them in half, ... and so on.

In my humble opinion, this is super easy to understand, in fact even easier than the original function and keeping track of indices.

So you could argue that the tutorial code is not meant for rendering thousands of objects anyways and there is no real speed gain for the scenes here. But I still believe that this change is worth looking at because:

  1. The code becomes overall simpler
  2. For people who actually do decide to take it further, they won't hit this road block and they could focus on more "fun" things like new shapes / materials / textures / lights etc.. :smile:

If there is any interest from your side in reviewing this, please let me know! From my side, it's ready to go!

Thank you for your consideration and happy rendering everyone!

P.S.: And here is my dragon, thanks to you guys! πŸ˜„

image_ACES Dragon20k

hollasch commented 8 months ago

My default response is typically against coding for performance improvements, since we want readers to implement their own code, and so we strive for a simple, easy-to-understand reference implementation. If I really want to do this, then I'm inclined to backlog items like this for future releases, as we're desperately trying to get v4 out the door.

But darn it, you present such a good case for this, and in the end I completely agree with you that your proposal is much simpler to understand, with large performance benefits as a free side effect. We gotta do this. Thank you very much for putting so much care and effort into your proposal.

How would you like to proceed? Would you like to craft this and create a PR? No worries at all if you'd rather we just do it.

hollasch commented 8 months ago

By the way, if you wanted to tackle this PR, another possible simplification/optimization is to assume that the hittable object lifetime is greater than the BVH construction, and use raw pointers instead of shared_ptr.

hollasch commented 8 months ago

While we're digging in this area of the code, see also #1328, something we might do at the same time.

relief7 commented 8 months ago

Hi Steve!

Thanks a lot for your reply and your consideration! I am glad if the proposal finds interest! :smile:

If I really want to do this, then I'm inclined to backlog items like this for future releases, as we're desperately trying to get v4 out the door.

Backlogging is totally fine with me, there is really no rush. If you guys are busy with getting the current release out, don't worry about it. There is always time to revisit it at a later point. I just filed the issue so it is at least written down somewhere "officially" where it can be tracked.

By the way, if you wanted to tackle this PR, another possible simplification/optimization is to assume that the hittable object lifetime is greater than the BVH construction, and use raw pointers instead of shared_ptr.

I tried to have a look at your suggestions today and ran a few tests using the final scene of Book Nr.2 as a "test bed". It fits quite well because there is this group of white spheres generated randomly where you can easily change the number of generated hittables:

benchmarks_scn_v0001

So to get a rough estimate how the different bvh splitting methods behave, I varied the hittable number and wrote down how long it takes to build the bvh.

As a disclaimer, I am not a hardcore optimization guy, as I don't have a lot of C++ experience. In fact, I am not a big fan of code that is as optimized as possible but at the same time super hard to understand, at least in hobby projects. So I am not 100% sure if I am the right guy for this. :smile: The tests here are not done in a very scientific way. :smile: But maybe we can get a few rough numbers, even though I totally agree that optimization should not be a top priority for this tutorial series at all. But anything we can get "for free" without sacrificing readability, I think we should totally address.

I ran 4 different implementations:

  1. current implementation with std::sort() (=dev branch)
  2. current implementation with std::nth_element() (from issue #1328)
  3. divide & conquer with std::sort() (=my previous post)
  4. divide & conquer with std::nth_element()

benchmarks_v0001

Findings in a nutshell (and my "layman" interpretations :smile:):

  1. in the current implementation, there doesn't seem to be linear relationship between object count and time needed to build the bvh. As you can see in the screenshot, when doubling the object count from 10,000 to 20,000 objects, the construction time for the bvh quadruples from ~4 sec to ~18 sec. No matter how basic our "intro-level" bvh should be, I personally still think we should do something about this as it poses a limit to what people can do with it. As people might want to play around with more objects in their own test scenes, they could find themselves locked in a situation where the bvh starts "choking" when going above 30,000 hittables which is not uncommon for a simple triangle mesh.
  2. using the std::nth_element algorithm as the splitting method from issue # 1328 in fact brings bvh construction time down but it needs to be acknowledged here that this algorithm requires C++ 17 compiler flag whereas current repo is set to use C++ 11
  3. the proposed divide and conquer method from my last post would bring great build speed improvements (20,000 objects are built 65x faster than current implementation), I even went up to 100,000 objects and it was still only 2 seconds (which is quite bearable IMHO). You can also see that we get an almost linear relationship between bvh build time and object count which would be very favorable so larger scenes would not explode and behave more "predictable".
  4. adding the std::nth_element method here also decreases bvh build time even further but again would need C++ 17

I created two pull requests (one for Nr. 3 and one for Nr. 4):

1391

1392

Please note: #1392 would also require change in compilation flags to include C++ 17 (I did not touch these).

If you decide to go with Nr. 2, it would be literally just one changed line of code so I don't think you would need a pull request for that, right?

By the way, if you wanted to tackle this PR, another possible simplification/optimization is to assume that the hittable object lifetime is greater than the BVH construction, and use raw pointers instead of shared_ptr.

You have a really good point with the raw pointer approach! But to be honest, I haven't had the time to fully test this anywhere yet as it would possibly require some bigger changes in more places (bvh class members, function arguments in comparators etc.). What I find interesting... According to this post on stack overflow:

https://stackoverflow.com/questions/22295665/how-much-is-the-overhead-of-smart-pointers-compared-to-normal-pointers-in-c

shared_ptrs don't seem to have a time overhead in de-referencing (=in getting the reference to owned object) so we might not gain much during the actual render but possibly a bit during the bvh build? On the other hand, they might come a bit at the cost of readability / "ease of use" in the code or a higher entry barrier for C++ intermediates? And a bigger pain to debug? So I am a bit torn to really go down that route. I think it would require some testing to see if it's worth it.

I can see if I have some time in the future to look into it further but I think in the meantime, we already have two really good alternatives with Nr.3 and Nr.4 for improved speed and ease of use without sacrificing readability, should we run into any issues with the current bvh implementation. πŸ˜„

Of course while simplyfing the overall bvh build process and make it more simple, which should be the primary goal of this operation. πŸ˜„

Or what do you think?

d-k-ivanov commented 8 months ago

@relief7, That's phenomenal! Thank you for such a detailed explanation and PRs.

I've played a bit with your implementations, even slightly more. In C++ 17 we also have parallel versions of algorithms in question, so I've tried to enable it as well.

- std::nth_element(objects.begin(), objects.begin() + mid, objects.end(), comparator);
+ std::nth_element(std::execution::par, objects.begin(), objects.begin() + mid, objects.end(), comparator);

The ray-tracing framework is still more or less usable for the 500'000 spheres in the box, using your both, β„–3 and β„–4, implementations. 2024-02-20 01_47_39-Ray Tracing

The implementation β„–3 is approximately twice slower. The rough (in seconds) BVH construction time for 500'000 spheres on my machine is:

std::sort "divide and conquer" = 5s
std::sort "divide and conquer" parallel = 5s
std::nth_element "divide and conquer" = 3s
std::nth_element "divide and conquer" parallel = 3s
std::ranges::nth_element non-parallel "divide and conquer" from ranges = 4s (from C++20)

Parallel or non-parallel algorithms seem doesn't add any improvements. (maybe just because I don't know how to use it properly πŸ˜…)

relief7 commented 8 months ago

Wow, this is pretty cool! I didn't know that these algorithms had parallel execution support. Thanks for bringing that up! Looks like it is also a C++ 17 feature. Not sure what platform you are on, but after some digging, I found this for my platform (Win64, MinGW, g++):

https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode_using.html

It says you need to turn on these compiler flags:

Are you using these already in your tests? Could be different if you are using MSVC or if you are on Mac etc.

When testing this with 100,000 objects, I get these results:

So I guess the gist is that it depends a lot on the algorithm if you gain anything from using the parallel algorithms. Maybe the overhead for multi-threading in this case (administrating the threads, packaging tasks, copying smart pointers) does not buy enough time to make it worth it. Possibly other algorithms than nth_element are a better fit for parallelization? If any optimization experts are here and know better, please correct me if I am wrong.πŸ˜„

Btw, other worthy compilation flags to look into are optimization flags like -O3 if you are not already using these anyways. These should help a lot with overall speed.

Nevertheless, I assume all these further optimizations are going way beyond the tutorial (and also this github issue), so I would leave it at that. Things like compiler optimizations are very specific and always depend on the reader's platform / compiler / c++ version etc. so it's hard to find a common denominator.

With a bvh build time of a few seconds and such a simple implementation, I would already be more than happy and I would then rather look at how to parallelize the render itself or make it faster as the tracing takes the bigger chunk of render time. I am more interested in understanding all the core concepts first (still need to do book nr.3!) than trying to squeeze every possible millisecond out of it. πŸ˜„

d-k-ivanov commented 8 months ago

@relief7 Markus, thanks for your prompt response.

I'm using Linux with GCC and Windows with MSVC. Yep, in general, parallel execution works pretty well. I'm using it for the main ray tracing loop in this way, and the gain is as huge as your divide-and-conquer approach for BVH.

std::for_each(std::execution::par, image.begin(), image.end(), [&](vec2 &pixel) {
    ...
});

But I write pixel colour to the memory using STB Image, with a PPM file this approach won't work due to the inability to write to the exact file location.

Anyway, that's not my point. Parallel execution lies way beyond the scope of the books. I just wanted to express my gratitude to you and say that neither of your works was lost in the backlog, and some people digging into closed PRs reaching for pearls. πŸ˜‰

Cheers!

relief7 commented 8 months ago

Hi! based on Arman Uguray's / @armansito's suggestions in pull request #1391 , I made some further tests.

To quote his reply, he suggested to do the following changes:

As an alternative, if we really wanted to eliminate all the redundant copies and vector allocations, we could allow the BVH construction to sort the object list in-place without creating any intermediate vectors. The code would look very similar to the original listing but instead remove the const qualifier from the src_objects declaration. If preserving the original order of the input objects is important, the input could be copied once and the in-place sort could operate over that initial copy (but I'm not sure if that's really necessary).

So the changes are:

Some rudimentary speed tests with this "in-place sorting":

Bvh construction for 10000 white spheres took: 0.115 seconds.
Bvh construction for 20000 white spheres took: 0.27726 seconds.

3 runs for 100,000 objects
Bvh construction for 100000 white spheres took: 1.84266 seconds.
Bvh construction for 100000 white spheres took: 1.78362 seconds.
Bvh construction for 100000 white spheres took: 1.7139 seconds.

500,000 objects
Bvh construction for 500000 white spheres took: 12.6192 seconds.

1,000,000 objects
Bvh construction for 1000000 white spheres took: 26.7226 seconds.

So compared to current code in the book, we gain again a lot of speed. Roughly ~65 times faster for 20,000 objects.

In terms of performance, it seems to be also faster than my previously proposed "divide & conquer" method.

Overall, I feel like @armansito's suggestion is probably the way to go because:

  1. best performance
  2. can stick to current indexing method in the book (which it was always meant to be?)
  3. if people are in the middle of their own implementation, it only requires little code changes to their existing source code (compared to bigger code / text changes with "divide and conquer")

I also tried the changes suggested by you, Steve @hollasch :

  1. removing the second condition for "object_span == 2" --> done!
  2. using: for (auto& obj : objects) instead of for (size_t object_index=0; object_index < size; object_index++) I tried this, but speed tanked by factor 20. Not sure what is causing this but since it made everything so much slower, I left it out for now. I even tried for (const auto & obj ... but same problem. Feel free to revisit on your side, I could not get it to work with decent speed.
  3. Move size definition down --> done!

I will go ahead and create a new pull request with this method, hopefully everybody is happy!

hollasch commented 8 months ago

bvh_node should definitely not alter the input list of objects β€” that would invalidate any existing references to the hittable list.

hollasch commented 8 months ago

Note that my earlier suggestion to use raw pointers created a copy of the input's shared_ptr values, with the understanding that the bvh_nodes lifetime should be shorter or equal to the lifetime of the input hittable list. This addresses the problem of invalidating the caller's hittable list references, at the cost of a single array copy.

hollasch commented 8 months ago

Done. See also #1416.