Closed joris-nijs closed 4 years ago
Hi, thanks for your contribution, much appreciated! I'll test your fix shortly.
You're absolutely right that there's a lot of contention in this area. In fact, not only that but the code that merges samples into levels is terribly inefficient as it's highly cache-incoherent (it's also using atomic operations and I'm sure we can design a better scheme).
I believe there are low-hanging fruits in that area. This is something I've been wanting to do for a long time but I really just don't have the time. Interested in having a look?
I still fixed a small bug I found in the changes I did earlier (as well as a part of the previous change that apparently did not make the commit somehow). If I find the time I might look a bit deeper in optimising that part. Any pointers are welcome.
If I find the time I might look a bit deeper in optimising that part. Any pointers are welcome.
We should reconsider whether the multi-scale approach currently used to provide low resolution feedback during camera interactions is the right one. It works reasonably well but it's not perfect and it's slow.
In terms of pure code optimization, LocalSampleAccumulationBuffer::store_samples()
is most likely the function that needs most attention. One should start by studying how cache-incoherent it is, then try to optimize. Can the samples be somehow sorted by tile before being merged? Can we do better than use atomic floating point additions?
Hi, I've reviewed your changes and I've been testing them quite extensively. Everything appears to work fine and I do seem to get a tiny speedup on my machine.
A few questions:
Here are my performance measurements on a Cornell Box scene at resolution 1536x1536, color transform set to None in appleseed.studio.
(SPP = samples per pixel)
When using interactive rendering the rendering is restarted often when updating the camera transformation. A lot of time is spend developing the AccumulatorTile to the Frame at higher active frame levels. This is a serial bottleneck when rendering a frame. By copying the AccumulatorTile in the critical section and developing outside, more time can be spend in the renders threads, resulting in a less noisy overal image when interacting. The cost is double the memory for the LocalSampleAccumulationBuffer.