Embree Optimizations - Githubissues

Theverat commented 6 years ago

The Embree API documentation makes some recommendations about performance:

MXCSR control and status register I could not find any code relating to this in LuxCore. Did I overlook it, is it done via compiler flags, or is it not implemented. If the latter, was there a reason not to do it?
Thread Creation and Affinity Settings I have not looked for this yet in the LuxCore source code.
Avoid store-to-load forwarding issues with single rays I have implemented a fix for this in a branch, see commit 74307728e7c8f1adb1315a74b94461cab12df243. So far I have done two tests with it, results below:

Scene: Sharlybg's tutorial (with the honey, cheese and vase etc.) pyluxcore in Blender

Old: After 2 minutes: Samples: 118 S/Sec: 324 k New: After 2 minutes: Samples: 126 S/Sec: 347 k (about 7% faster)

Scene: cornell box luxcoreconsole

Old: [Elapsed time: 179/180sec][Samples 638/0][Convergence 0.000000%][Avg. samples/sec 0.93M on 0.0K tris] New: [Elapsed time: 179/180sec][Samples 666/0][Convergence 0.000000%][Avg. samples/sec 0.98M on 0.0K tris] (about 5% faster)

System used: Linux Intel(R) Core(TM) i7-3635QM CPU @ 2.40GHz

So there was a small performance gain in both tests. Questions from my side:

Are there already wrapper functions for these SSE instructions in LuxCore that I should use?
Do I have to add #ifdefs to check for SSE support, or can I take it for granted?
Can you reproduce the performance gain on your system(s)?
Is there anything speaking against this stuff?

Another possible optimization I'm not sure about yet: Embree offers an rtcOccluded1 function to trace a ray without caring for hitpoint information, just checking if it hits any geometry. Wouldn't this be perfect for shadow ray tests of environment lights? I could not yet find any information if rtcOccluded1 is actually faster than rtcIntersect1 though, I'll have to test that.

Theverat commented 6 years ago

Another test with the branch could not show any performance gain:

Scene: cornell box luxcoreconsole

Old: [Elapsed time: 599/600sec][Samples 1815/0][Convergence 0.000000%][Avg. samples/sec 0.79M on 0.0K tris] New: [Elapsed time: 599/600sec][Samples 1793/0][Convergence 0.000000%][Avg. samples/sec 0.78M on 0.0K tris]

This time I rendered 10 minutes just to eliminate any CPU turbo, then I started the "optimized" luxcoreconsole, then I started the old luxcoreconsole right after the last one finished. It looks like my first optimistic test was screwed up by the CPU turbo.

By the way:

I suspect that the denormalized float fix will also not help much (we would have to find out if denormalized floats happen in LuxCore at all - I suspect it doesn't happen much).
Thread affinity also sounds to me like it won't change a lot.

I will try the rtcOccluded1-function next.

Dade916 commented 6 years ago

MXCSR control and status register

This can potentially lead to some not IEEE compliant behavior and some of the code relay on compliance to work (for instance, ray/bbox intersection code). For the same reason, we don't use "fast-math" compiler options (it has lead, in the past, to very annoying and subtle bugs to track). So, basically, the (small) performance boost doesn't look worth the risk.

Thread Creation and Affinity Settings I have not looked for this yet in the LuxCore source code.

This is useful mostly with very high thread count, mostly for scene BVH building and for veeeery complex scenes.

Embree offers an rtcOccluded1 function

We always need to know the intersection point because of the support of stuff like material transparency, volume scattering, etc. It is like old Lux, we never use "true shadow rays" (i.e. occlusion test only).

Theverat commented 6 years ago

I see, thanks for the insight. Looks like I should not waste any more time on this topic :)

Theverat commented 6 years ago

I have one more question: The Vector, Normal, Color classes all operate a lot on 3 elements at once (e.g. addition). Do you think it would be worth a try to use SIMD instructions here to operate on 4 elements at a time (1 "fill element"? Is this even practical? Would the 4th "fill element" blow up the space needed for meshes, images etc. in RAM?

For classes with only 3 elements it's probably not a good idea, but maybe we can use it for Matrix operations?

Dade916 commented 6 years ago

We had hand written SSE code for vector class in old LuxRender but it was removed at some point because it wasn't making any difference and it was something additional to maintain. If you dig in old BitBucket repository, you should be able to recover the sources and do a test. But the outcome may be the same (i.e. no improvement).

Padding the fields with a 4th elements is not practical but newer SSE versions have instructions to read 32bit aligned vectors (with some penalty). It is the reason why Embree stores only 3 vector elements but requires a padding at the end of vertex list so it can read 4 x float when accessing the last vertex.

LuxCoreRender / LuxCore

Embree Optimizations #127