Multithreaded rendering (for example, using OpenMP)

ronaldfw commented 5 years ago

Rendering time in a single-threaded C++ implementation becomes an issue when getting to the larger examples and when trying to render with many samples per pixel, in particular, before doing the BVH implementation in book 2.

There is a fairly simple change to the code that adds multithreading over the rows of the output image that speeds up rendering times a lot. This allows for faster experimentation with different scenes and different code.

The change consists of two parts: (1) Render into a buffer instead of printing pixel values to cout. (2) Parallelize over y of the output image (obviously this is only possible after making the change in (1)). Using OpenMP, (2) can be done in a single line.

On my 6-core/12-hyperthread CPU, the total speedup of both changes is somewhere around 14x. Ignoring file output, I get about 5.6x speedup over the single-threaded version.

On the flip-side, the multithreaded version takes away some of the purity of the original code: writing into a buffer is more complicated than just printing every pixel value to cout; OpenMP support needs to be turned on, though I believe most C++ compilers support this (tested only g++ and Visual Studio). An alternative could be to use std::thread.

If you consider this a good addition, I'd be happy to:

Write a short section for the book that explains how to add multithreading (possibly as an appendix).
Make the code changes needed by adding rendering into a buffer, adding a function to output the image and adding multithreading.

petershirley commented 5 years ago

This is in the same vein as something Steve and I were discussing. In the book there is the simplest BVH build. But I have done a SAH based build and it produces very good speedups without that much pain. But inserting it would start bloat. It's not clear what the project's philosophy should be on these. An appendix chapter perhaps as suggested above?

Multithreading in particular is an extreme case because of the bang for buck. But also brings in libraries which is painful.

All in all this seems like a good concrete example to try to use to expose a more explicit philosophy. Opinions?

hollasch commented 5 years ago

I would suggest that with the org, projects such as these get their own GitHub repos (books + source). If someone has already done, say, Ray Tracing the Next Week, they shouldn't have to go back to some state to tackle a new series of chapters in the middle. Instead, add another volume to the library, working to maintain the same underlying philosophy of the series: featherweight publishing model, strongly encouraging readers to write their own code.

I agree that pulling in external libraries is painful, particularly for cross-platform development. That in turn brings in build systems and all the fun with that. Still, I'd say “let a thousand flowers bloom”.

One suggestion would be to provide two sets of source code: one the starting body of code (which might just be a reference to the final code for one of the books, but you'll have to watch for changes), and two, the final completed code. Like the other three books, the new book should walk the reader through modification of the existing codebase to the new multi-threaded result.

trevordblack commented 5 years ago

+1 to adding another volume to the library.

Would each technique get it's own, 5-10 page volume, or do we collate another half-dozen? There is a natural learning progression between multithreading, then sampling, then filtering.

However, the OP is explicitly stating the value in placing multithreading among the techniques in The Next Week. An additional volume implies implementing this thing at the end.

ronaldfw commented 5 years ago

Note that neither OpenMP nor std::thread require additional libraries. For the former, the compiler needs to support it (many do: https://www.openmp.org/resources/openmp-compilers-tools/ including the Visual Studio C++ compiler, although it's not listed there), for the latter you need C++11.

The example one has at the end of volume 1 (random scene with close to 500 objects; nx=1200, ny=800, ns=10; parameters taken from the src code in this repository) takes ~180s on my machine to render. I'm only seeing a 5x speedup for this particular instance using the change I described above, but this is still quite significant when you need to or want to experiment. (Depending on scene, platform, etc. I've seen a 5x up to 14x speedup, though I'm beginning to doubt the 14x number.)

If I had had the idea to use OpenMP earlier, I would have added this as soon as there's the first loop over the rows of the image. This is chapter 3 in the first book. The advantage you get in code iteration speed is just too big to pass up.

It's also worth noting that the changes are very local: Write into buffer instead of std::cout line; new method to write buffer to file; single OpenMP line above for loop over rows of the output image.

hollasch commented 5 years ago

It's also worth noting that the changes are very local

That makes this much more interesting. As this is an introductory text, I believe that understanding how a single-threaded approach would work is a prerequisite to parallelizing the code. However, your point about iteration is very well taken.

Can we do both approaches? Given your suggestions above, it seems like we could introduce a #if PARALLEL construct to use either serial or parallel approaches. Indeed, seeing the two side-by-side would be instructive for many. If I understand what you're suggesting with respect to data structures and functions, it seems like this would be doable, and succinct.

If this is accurate, could you cobble up an example of how you think this would look? I've sent an invite so you can just work in a feature branch without the overhead of creating your own fork. Just code for now — updating the book content can come later.

ronaldfw commented 5 years ago

There are now two examples how to add multithreading in this feature branch: https://github.com/RayTracing/InOneWeekend/tree/ronaldfw/multithreading

Using OpenMP (main_openmp.cc)

This is the elegant solution as it requires only one line to add the multithreading. Unfortunately setup of OpenMP on Mac OSX seems to be more involved and I didn't even get it to run. I was only able to test this on Windows (which requires more changes as drand48() is not available there). Not being able to use this on Mac easily rules this option out I think.

Using std::threads (main.cc)

Requires to be compiled with "-std=c++11". This works well but the code changes are a little more involved than just adding a single line. One my iMac, I get a speedup of 3.33 (4 physical cores).

hollasch commented 5 years ago

I'd prefer to stick with the standard. C++11 seems quite reasonable.

ronaldfw commented 5 years ago

My current thinking on this is that we should add an appendix to the book that adds multithreading (if at all). I agree with Steve's point that this is an introductory text and multithreading, while quite simple to implement here, complicates things in a way that makes focusing on the core of what folks are aiming to learn here (ray tracing) more difficult.

If we have this in an appendix, folks will still have the option to go there. We could have a paragraph in chapter 3 along the lines of

"If you are interested in adding multithreading to your code, you can now take a detour through the appendix "Multithreaded Rendering". In general we recommend doing so as in particular it helps a lot with iteration speed when experimenting with code. But since this book is about learning ray tracing and not programming, we don't want to include this in the main part of the book. [...]"

The appendix should then discuss how to render into a buffer and how to add multithreading. For the latter I'd recommend we stick to C++11's std::thread. While OpenMP is a lot more elegant, getting it to work on macOS is a pain. I think the more involved code changes using std::thread are okay if this is done in the appendix where we can have a more elaborate discussion.

I've started drafting the book parts for the appendix but it's not yet ready to be shared.

ronaldfw commented 4 years ago

I'm creating a new branch on the new repository to keep working on this and bring over all the changes I had made earlier on the InOneWeekend repository.

Also, as has been commented elsewhere, this requires a thread-safe approach to random number generation. My first version will not get this right in all aspects but will use thread_local on the random number generators. The remaining issue is that they will all be seeded with the same state.

hollasch commented 4 years ago

I'm pulling this out of the v3 milestone, as it's not essential to that release. It would be great to have it, but it shouldn't hold things up, and you should feel free to get this done in your own time.

trevordblack commented 4 years ago

I've written a chapter to walk the user through this. It needs more edits, but the bones are there

D-K-E commented 4 years ago

@trevordblack Might I suggest the use of std::async from C++11 ? I implemented this with futures and async. It was really a breeze. It takes about ~20 lines more in my very basic implementation. The multi threaded version of perlin noise example with two spheres, take 26609664 microseconds on 4 cpus. The same example on single thread take 74364349 microseconds. That's about 2.7x speed up. My implementation is very basic, it could probably be further optimized.

trevordblack commented 4 years ago

Hmmm.

I've a chapter written that's about 1/2 the length of book 2 or book 3 (it's a chonker). As written, the user goes through a series of naive implementations before arriving at the "final" solution. When writing, I made the decision to avoid dealing with semiphores or futures, simply because they aren't necessary, and every language has a slightly distinct way to doing threading/semiphores/locks.

The assumption with the chapter is that everyone is starting from zero, and that many people would be writing their code in a different language.

At present I explain what a race condition is, where it comes from, and some simple means of dealing with them. All of these solutions boil down to making sure all threads read from constant data and write to independent memory locations. I don't think I'll have fulfilled the spec if I don't keep all that in (explaining multithreading w/o explaining race conditions would be a mistake).

Spending an extra 500 (?) words to explain semiphores/futures/locks wasn't a thing I wanted to do.

But I'll loop back around and consider it. Using a feature married pretty heavily to the std is not something I'm thrilled to do.

Not that it's a bad idea. I just wonder if it's outside scope...

D-K-E commented 4 years ago

The assumption with the chapter is that everyone is starting from zero, and that many people would be writing their code in a different language.

I have been doing live coding with the content here and this came up quickly from watchers, that is some of my followers expressed their desire to implement the content using different languages, most notably rust. They asked me if the code depends any external libraries, etc.

I would say staying with your initial idea seems better suited and better aligned with the general philosophy of the work in general, and I would not be surprised if you consider the following out of scope.

Maybe after the 'final' solution or near the final an alternative listing can be provided with async, saying its just a high-level wrapper of threads/futures etc. The only positive I can think of is the ability to return value from them, which I find slightly easier than ensuring different memory write locations explicitly. The overall structure of my final implementation was:

struct ImageTile;
struct ReturnTile;
std::vector<std::future<ImageTile>> fs(THREAD_NB);
for(int i=0; i < THREAD_NB; i++){
    ImageTile imt = makeTile(...);
    fs[i] = std::async(traceTile&, imt); // the inner loop with pixel samples
}
std::vector<std::vector<color>> img(imwidth, std::vector<color>(imheight, color(0,0,0)));
for (auto &f: fs){
    ReturnTile ret = f.get()
    joinTileToImv(ret, img);
}
writeImgToPpm(img);

But again it's true that it would be out of scope, even slightly harmful, for introducing someone starting from zero to multithreaded rending.

D-K-E commented 4 years ago

Another quick point came up while I was debugging multithreaded version of the final scene of The Rest. I was debugging with gdb and debugging multithreaded version proved to be more difficult. Then I realized that it does not take that much to make the async version to be single threaded again. Basically by changing:

std::vector<std::future<ImageTile>> fs(THREAD_NB);
fs[i] = std::async(traceTile&, imt); // the inner loop with pixel samples
ReturnTile ret = f.get()
joinTileToImv(ret, img);

to

std::vector<ImageTile> fs(THREAD_NB);
fs[i] = raceTile(imt);
joinTileToImv(f, img);

In the above code, I was able to drop back to single threaded execution. Maybe it is not directly relevant to the main point, ie increasing performance, but I thought it might be a point to consider.

mratsim commented 4 years ago

One issue with multithreading is the random_double() function. If it is not referring to a per-thread RNG you will have contention on the random number generator at best and corrupted random state at worst.

And the artifacts you get are pretty visual if you try to avoid that an incorrect way, for example in smallPT when i tried to parallelize both the rows and columns I used a very simple strategy to just have each thread seed their rng from the iteration variable but this led to:

expected

with 2 parallel loops and no RNG precautions

Source: single loop parallel, correct rendering (in Nim, a compiled language with a syntax similar to Python)

    parallelFor y in 0 ..< h:               # Loop over image rows
      captures: {buf, samples}
      stderr.write &"\rRendering ({samples*4} samples per pixel) {100.0*y.float64/float(h-1):5.2f}%"
      var xi = [cushort 0, 0, cushort y*y*y]
      for x in 0 ..< w:                     # Loop over columns
        let i = (h-y-1)*w+x

Source: 2 parallel loop, RNG artifact

    parallelFor y in 0 ..< h:               # Loop over image rows
      captures: {buf, samples}
      stderr.write &"\rRendering ({samples*4} samples per pixel) {100.0*y.float64/float(h-1):5.2f}%"
      parallelFor x in 0 ..< w:             # Loop over columns
        captures: {y, buf, samples}
        var xi = [cushort 0, 0, cushort y*y*y]
        let i = (h-y-1)*w+x

xi is the RNG state.

Now, the proposed solution is only to parallelize the outer loop, but given that the RNG state is a global, at least we will have to use a __thread annotation in random_double()

D-K-E commented 4 years ago

Yes, I completely forgot, the source code here uses rand for generating numbers in random_double :facepalm:.

I have been using the following generator for awhile now, I had not experienced any problems with it:

inline double random_double() {
  // random double number
  static std::uniform_real_distribution<double> distr(0.0, 1.0);
  static thread_local std::mt19937 gen;
  static std::function<double()> rand_gen = std::bind(distr, gen);
  return rand_gen();
}

It should be thread safe

trevordblack commented 4 years ago

I was/am aware of the problem of sharing random numbers across different threads, but it didn't come up as a problem when I was writing. I'll loop back around

hollasch commented 4 years ago

If you look at the generated images in my latest PR, there are definite RNG artifacts, I think. I might want to swap out the random_double() function and see if it changes the outcome.

kyamant commented 3 years ago

I was able to achieve significant speedup, the need for which became painfully apparent in the final scene of the second book, by using #pragma omp parallel for before the multi sample loop and using the following in rtweekend.h:

static std::mt19937 engine;
static std::uniform_real_distribution<double> uniformDist(0.0, 1.0);

inline double random_double() {
    // Returns a random real in [0,1).
    return uniformDist(engine);
}

kyamant commented 3 years ago

In fact, one can avoid Pseudo RNG altogether, and hence the consequences, by simply using std::random_device instead of std::mt19937 since most CPUs nowadays have noise sampling based random number generation capability:

std::random_device randomDevice;
if (randomDevice.entropy() > 0){
//CPU is capable of True Random Number Generation 
}else{
//Deal with the Pseudo Random Number Generation
}

trevordblack commented 3 years ago

I mean, my preference has always been to force the user to write their own PRNG.

Truly random samples are of marginal value in RT. You want to be able to shape your samples as blue noise or LDS

trevordblack commented 3 years ago

Unless the prng that we're using is TRULY awful, it really shouldn't have an effect on visual quality. This isn't a cryptographic cypher

kyamant commented 3 years ago

I was trying to make it easy for people to use OMP, without getting into pseudo RNG related issues when multiple threads are involved.

trevordblack commented 3 years ago

This is basically thread safe for our purposes:

uint32_t xorshift128(void)
{
    static uint32_t x = 123456789;
    static uint32_t y = 3624396069;
    static uint32_t z = 521288629;
    static uint32_t w = 88675123;
    uint32_t s, t = w;
    t ^= t << 11;
    t ^= t >> 8;
    w = z; z = y; y = s =  x;
    t ^= s;
    t ^= s >> 19;   
    x = t;
    return t;
}

float randf(){
    float r = xorshift128() / 4294967296.0f;
    if (r != 1)
        return r;
    else
        return randf();
    return r;
}

You're right that we could call in all manner of really cool and functional libraries, even truly random ones! I just think it's removing a learning opportunity

kyamant commented 3 years ago

The idea is to prevent different threads to repeat the same sequence; in the posted code there isn't any thread specific logic nor any protection, such as mutex, to protect internal state. Any two threads can execute the same sequence.

trevordblack commented 3 years ago

Wait, back up.

Why is two threads executing the same sequence a problem?
All threads share the posted code. They're all modifying the static variables on top of each other. This is fine.

fnw commented 2 years ago

It's look like it's been a while since the discussion went cold here, but I'd like to chime in and also to know what the current consensus is.

The final scene of Book 2 was basically impossible slow for me, so I decided to give parallelism a shot.

With OpenMP, you can achieve this with two lines of OpenMP annotations and not a single change to the code itself. But @ronaldfw has indicated that setting up OpenMP on Mac is troublesome, so I decided to try it with only standard library features (C++20), and (mostly) managed to do it without major changes to the code, but there is a catch.

First, the changes. I parallelized the sample accumulation loop, since it was the easiest target and made for the least changes. The changes that I made also don't require you to think about indexes much, so I find them easier to understand.

We are gonna need a vector to store the samples before accumulating them, and another for listing the indexes from [0, samplesPerPixel):

std::vector<Color> samples(samplesPerPixel);
std::vector<int> idxs(samplesPerPixel);
std::iota(idxs.begin(), idxs.end(), 0);

These can be created only once before the main loop. Also note that the second vector could possibly be replaced with std::views::iota, but current threading implementations don't seem to like this idea.

As for the sample accumulation loop, it becomes:

std::for_each(std::execution::par_unseq, idxs.begin(), idxs.end(), [i, j, imageHeight, imageWidth, cam, &background, &world, maxDepth, &samples](auto&& s){
                auto u = (i + randomDouble()) / (imageWidth - 1);
                auto v = (j + randomDouble()) / (imageHeight - 1);
                Ray r = cam.getRay(u, v);
                samples[s] = rayColor(r, background, world, maxDepth);
            });

Here we are using range_for with a parallel execution policy, which requires including <execution>. We needed that second vector for the range_for to iterate over. The captures are basically to conform to the definition of rayColor.

Finally, writing the color to the output becomes:

writeColor(of, std::accumulate(samples.begin(), samples.end(), Color(0,0,0)), samplesPerPixel);

I don't know whether the random number generator is thread-safe, but the output seems correct to me, at least when rendering at 400x400 and a 1000 samples per pixel (for speed).

Now comes the catch: at least on GCC 11.1.0 under Linux, actually getting the loop to use more than one thread requires linking against Intel's Thread Building Blocks, using -ltbb, which has to be installed from source. This is the kind of rabbit hole you would want to protect the readers from. Nevertheless, I'm leaving this comment here because threading implementations might improve in the future, and this would become a superior alternative to using OpenMP, at least in terms of only requiring the standard library. Also, if the adventurous readers find this issue, it might be an option for them.

In conclusion, as it stands right now, I find OpenMP to be the superior alternative (except for the Mac ecosystem), since AFAIK it ships by default with GCC (and maybe MSVC? I'm not sure).

trevordblack commented 2 years ago

If we ever do multithreading it'll be with c++11 std::thread, not OpenMP The reasons for this are many.

fnw commented 2 years ago

@trevordblack The solution I posted is a good alternative if depending on OpenMP is problematic, but it does use C++ 20. However, the TBB dependency is probably just a quirk of GCC; I don't know how clang and MSVC implement this.

Anyway, if sticking to C++11 is a requirement, here's a solution that builds on ronaldfw's branch, but makes smaller changes and has less indexing to deal with:

for (int i = 0; i < imageWidth; ++i)
{
    for(int t=0 ; t < nThreads; ++t)
    {
        threads[t] = std::thread(std::bind([&](int beg){
            for (int s = beg; s < samplesPerPixel; s += nThreads)
            {
                auto u = (i + randomDouble()) / (imageWidth - 1);
                auto v = (j + randomDouble()) / (imageHeight - 1);
                Ray r = cam.getRay(u, v);
                samples[s] = rayColor(r, background, world, maxDepth);
            }
        }, t));
    }

    for(int t = 0; t < nThreads; ++t)
    {
        threads[t].join();
    }

    writeColor(of, std::accumulate(samples.begin(), samples.end(), Color(0,0,0)), samplesPerPixel);
}

It still requires a vector of samples, but we can drop the vector of indexes from my last comment. Also, we need to create a vector of threads:

const int nThreads = std::thread::hardware_concurrency();
std::vector<std::thread> threads(nThreads);

Under GCC and Linux I had to use -lpthread to link against pthread, but again, this sould be GCC-specific.

trevordblack commented 2 years ago

Couple of changes would be needed:

We would want to create a distinct function:

void thread_function_of_some_sort(...) {
        for (int s = beg; s < samplesPerPixel; s += nThreads)
        {
            auto u = (i + randomDouble()) / (imageWidth - 1);
            auto v = (j + randomDouble()) / (imageHeight - 1);
            Ray r = cam.getRay(u, v);
            samples[s] = rayColor(r, background, world, maxDepth);
        }
}
...
  threads[t] = std::thread(..., thread_function_of_some_sort);

Because, while the syntax above is clean, it's not very intuitive. So I would motivate the distinct function and create it.

Near as I can tell, you're spinning up a unique thread per pixel. We would be splitting up the image by row or by collections of rows.
Score std::accumulate from the earth, although I'm sure you wrote that for brevity.

There's no real good way to remove the vector of samples. Once you have multiple threads writing to data you need a way to guarantee there isn't aliasing or race conditions. You an do this by allotting blocks of memory that is unique per thread (vector of samples) or allot a file of memory that is unique per thread

The bottleneck for the change here isn't that we can't do it, or don't know how. It just hasn't been a priority, and figuring out a good place to put it within the context of the three books has been annoying.

On the whole though, I'm glad that you got it working and are pushing for this again.

fnw commented 2 years ago

About your points:

Okay, I agree on the function. The lambda captures just made it quicker to write, since that function is going to have a lot of parameters.
Yes, that was on purpose. The idea was just to make the smallest possible change that would allow for taking advantage of parallelism. I guess it really depends on what the book would be teaching in this section. In order to make the code parallel over rows there's a bit more indexing going on, and it would be appropriate to discuss threading more carefully and these kinds of things if the changes are going to be bigger. But I'll admit it's slower - I've noticed that CPU usage is lower (80 something instead of nearly 100% with OpenMP or range_for) and that the execution time seems to be higher (haven't really timed it though). But I agree, unless we are dealing with a lot of samples, the overhead is going to be quite undesirable.
I mostly live in a Python world these days, so I don't really know what's wrong with std::accumulate (I was thinking in terms of Python's sum), but removing it wouldn't really be a problem.

There's no real good way to remove the vector of samples.

Yes, I noticed that. It's either the vector or synchronization, which kinda defeats the purpose.

The bottleneck for the change here isn't that we can't do it, or don't know how.

I also noticed that. I chimed in to try and give an option with the minimum possible changes, so that they are easier to explain to the reader. I was thinking the implementation that I proposed could be a small (optional) subsection before the final scene of Book 2, because that's where you really feel the need for parallelism.

Anyway, I think this is now a problem of deciding how to write the book, so we should just let the contributors that had already been involved in this issue weigh in. But feel free to tag me when you need an extra pair of hands on this :)

trevordblack commented 2 years ago

"Anyway, I think this is now a problem of deciding how to write the book, so we should just let the contributors that had already been involved in this issue weigh in."

That would be me

wlbksy commented 6 months ago

Just eager to know is this still alive, since the final scene of Book 2 is way too slow.

hollasch commented 6 months ago

If this happens, it will be its own book or other work. We will not be doing this in the regular series.

hollasch commented 1 month ago

@trevordblack — do you want to take what you already have so far and at least plop it down in a new repo/book?

RayTracing / raytracing.github.io

Multithreaded rendering (for example, using OpenMP) #148