Closed Theverat closed 6 years ago
Another test with the branch could not show any performance gain:
Scene: cornell box luxcoreconsole
Old: [Elapsed time: 599/600sec][Samples 1815/0][Convergence 0.000000%][Avg. samples/sec 0.79M on 0.0K tris] New: [Elapsed time: 599/600sec][Samples 1793/0][Convergence 0.000000%][Avg. samples/sec 0.78M on 0.0K tris]
This time I rendered 10 minutes just to eliminate any CPU turbo, then I started the "optimized" luxcoreconsole, then I started the old luxcoreconsole right after the last one finished. It looks like my first optimistic test was screwed up by the CPU turbo.
By the way:
I will try the rtcOccluded1
-function next.
MXCSR control and status register
This can potentially lead to some not IEEE compliant behavior and some of the code relay on compliance to work (for instance, ray/bbox intersection code). For the same reason, we don't use "fast-math" compiler options (it has lead, in the past, to very annoying and subtle bugs to track). So, basically, the (small) performance boost doesn't look worth the risk.
Thread Creation and Affinity Settings I have not looked for this yet in the LuxCore source code.
This is useful mostly with very high thread count, mostly for scene BVH building and for veeeery complex scenes.
Embree offers an rtcOccluded1 function
We always need to know the intersection point because of the support of stuff like material transparency, volume scattering, etc. It is like old Lux, we never use "true shadow rays" (i.e. occlusion test only).
I see, thanks for the insight. Looks like I should not waste any more time on this topic :)
I have one more question: The Vector, Normal, Color classes all operate a lot on 3 elements at once (e.g. addition). Do you think it would be worth a try to use SIMD instructions here to operate on 4 elements at a time (1 "fill element"? Is this even practical? Would the 4th "fill element" blow up the space needed for meshes, images etc. in RAM?
For classes with only 3 elements it's probably not a good idea, but maybe we can use it for Matrix operations?
We had hand written SSE code for vector class in old LuxRender but it was removed at some point because it wasn't making any difference and it was something additional to maintain. If you dig in old BitBucket repository, you should be able to recover the sources and do a test. But the outcome may be the same (i.e. no improvement).
Padding the fields with a 4th elements is not practical but newer SSE versions have instructions to read 32bit aligned vectors (with some penalty). It is the reason why Embree stores only 3 vector elements but requires a padding at the end of vertex list so it can read 4 x float when accessing the last vertex.
The Embree API documentation makes some recommendations about performance:
Scene: Sharlybg's tutorial (with the honey, cheese and vase etc.) pyluxcore in Blender
Old: After 2 minutes: Samples: 118 S/Sec: 324 k New: After 2 minutes: Samples: 126 S/Sec: 347 k (about 7% faster)
Scene: cornell box luxcoreconsole
Old: [Elapsed time: 179/180sec][Samples 638/0][Convergence 0.000000%][Avg. samples/sec 0.93M on 0.0K tris] New: [Elapsed time: 179/180sec][Samples 666/0][Convergence 0.000000%][Avg. samples/sec 0.98M on 0.0K tris] (about 5% faster)
System used: Linux Intel(R) Core(TM) i7-3635QM CPU @ 2.40GHz
So there was a small performance gain in both tests. Questions from my side:
#ifdef
s to check for SSE support, or can I take it for granted?Another possible optimization I'm not sure about yet: Embree offers an rtcOccluded1 function to trace a ray without caring for hitpoint information, just checking if it hits any geometry. Wouldn't this be perfect for shadow ray tests of environment lights? I could not yet find any information if
rtcOccluded1
is actually faster thanrtcIntersect1
though, I'll have to test that.