appleseedhq / appleseed

A modern open source rendering engine for animation and visual effects
https://appleseedhq.net/
MIT License
2.2k stars 329 forks source link

Shorten critical section when developing accumulation buffer to frame #2835

Closed joris-nijs closed 4 years ago

joris-nijs commented 4 years ago

When using interactive rendering the rendering is restarted often when updating the camera transformation. A lot of time is spend developing the AccumulatorTile to the Frame at higher active frame levels. This is a serial bottleneck when rendering a frame. By copying the AccumulatorTile in the critical section and developing outside, more time can be spend in the renders threads, resulting in a less noisy overal image when interacting. The cost is double the memory for the LocalSampleAccumulationBuffer.

dictoon commented 4 years ago

Hi, thanks for your contribution, much appreciated! I'll test your fix shortly.

You're absolutely right that there's a lot of contention in this area. In fact, not only that but the code that merges samples into levels is terribly inefficient as it's highly cache-incoherent (it's also using atomic operations and I'm sure we can design a better scheme).

I believe there are low-hanging fruits in that area. This is something I've been wanting to do for a long time but I really just don't have the time. Interested in having a look?

joris-nijs commented 4 years ago

I still fixed a small bug I found in the changes I did earlier (as well as a part of the previous change that apparently did not make the commit somehow). If I find the time I might look a bit deeper in optimising that part. Any pointers are welcome.

dictoon commented 4 years ago

If I find the time I might look a bit deeper in optimising that part. Any pointers are welcome.

We should reconsider whether the multi-scale approach currently used to provide low resolution feedback during camera interactions is the right one. It works reasonably well but it's not perfect and it's slow.

In terms of pure code optimization, LocalSampleAccumulationBuffer::store_samples() is most likely the function that needs most attention. One should start by studying how cache-incoherent it is, then try to optimize. Can the samples be somehow sorted by tile before being merged? Can we do better than use atomic floating point additions?

dictoon commented 4 years ago

Hi, I've reviewed your changes and I've been testing them quite extensively. Everything appears to work fine and I do seem to get a tiny speedup on my machine.

A few questions:

dictoon commented 4 years ago

Here are my performance measurements on a Cornell Box scene at resolution 1536x1536, color transform set to None in appleseed.studio.

(SPP = samples per pixel)

Master

This PR