Scene with 18% v3.7-stable to v3.8 (master at ecad9eb) performance degrade.

Summary

The simple scene below is showing a much larger than normal 3.7 to 3.8 performance degrade (+18.4%) when no Anti-aliasing (AA) used and a more reasonable +3.4% degrade with relatively strong AA. Other scenes show the same non-AA to AA slowdown ratio (3-6 to 1), but at much less total slow down.

Note: I plan to dig further into this myself.

Environment

Ubuntu 16.04.... I'll edit with more if my later investigation warrants.

Steps to Reproduce

/usr/bin/time p370 -j -wt1 -fn -p -d +w1200 +h900 FS14.pov -a0.05 +am2 +r5

/usr/bin/time p380 -j -wt1 -fn -p -d -cc +w1200 +h900 FS14.pov -a0.05 +am2 +r5

Expected Behavior

Much less of a performance degrade.

Actual Behavior

Much slower than expected.

Scene

#declare testmat = material { texture {
      pigment {color <1,0,0> transmit 0.5}
}};

camera { location  <1.0, 2.0, -4.0>
   direction 1.5*z
   right     x*image_width/image_height
   look_at   <1.0, -0.2,  0.0>
   angle 30 }

light_source {<-10, 10, -5> color rgb <1, 1, 1>}

plane {y,-1 pigment{rgb <1,1,1>}}

union {
   sphere {<0.0,0,0>,0.5 material{testmat}}
   sphere {<0.1,0,0>,0.5 material{testmat}}
   sphere {<0.2,0,0>,0.5 material{testmat}}
   sphere {<0.3,0,0>,0.5 material{testmat}}
   sphere {<0.4,0,0>,0.5 material{testmat}}
   sphere {<0.5,0,0>,0.5 material{testmat}}
}

Trying a bit of investigation: valgrind --tool=callgrind --separate-threads=yes dmpovray -j -wt1 -fn -p -d -c +w1200 +h900 FS14.pov -a0.05 +am2 +r5 valgrind --tool=callgrind --separate-threads=yes dspovray -j -wt1 -fn -p -d -c +w1200 +h900 FS14.pov -a0.05 +am2 +r5

dmpovray is master in debug mode dspovray is stable in debug mode

track.txt

I just cannot find the reason so far. It seems to turn between pov::Trace::TraceRay, ComputeTextureColour, ComputeOneTextureColour, ComputeLightedTexture, ComputeRefraction and back to TraceRay

Thanks for taking a look. I'd not of thought to use valgrind - and I admit too, to not being sure I'm following all the track.txt output.

Some thoughts pop to the surface on your post.

First, while I don't think it matters due the use of -fn, the -c in my flag list should have been -cc - I'll go edit the original post to fix that. Second, I'd had it on my mental list to try my 'povr' version again with my recent updates - still also about 20% slower so nothing I've done lately helps.

Third, in the original testing heavy AA softens the v3.7 to v3.8 penalty which I always took as a hint the slow down might be something between parsing and tracing. Suppose though it might be a caching effect as AA rays tend to have a more narrow locality of reference memory wise.

Lastly, in the track.txt file I saw again cooperate() and was reminded I still don't really understand its (CooperateFunctor) function other than it being thread related. Not asking you to go investigate, but do you or others know off the top of your head what it's doing in some detail?

Stopping for the day, but a major slowdown happened between commits e16458c3 and 3f56c5fb. Using statically linked generic architecture -O3 compiled modules, we were running -6% faster than v3.7 stable at commit e16458c3 and then are slower from that speed by +16.8% at commit 3f56c5fb. Given the commits in between I'd bet on 725efe1f, but we'll see.

x 3f56c5fb message: Fixed Unix build errors.
  725efe1f message: Merge branch 'refactor'
  1fac52a9 message: Merge branch 'feature/unit_tests'
  845851e7 message: Source code documentation pertaining to entities spanning more than one file
x e16458c3 message: Fixed Unix build errors

Rushed info as I'll soon be away until next week. I was able to use the existing O3 generic arch static modules above with the perf tool to see enough that I think the BBoxPriorityQueue introduction is probably the cause of the performance slowdown.

3f56c5fb_opt3.txt

e16458c3_opt3.txt

Certainly more to do including Og compiles for more readable perf output. Also need to verify while looking at the current v37 stable and v38 master to be sure we see a similar indication BoxPriorityQueue is the cause.

Oh, I forgot to say in the previous comment, I did image compares across all versions including v37 and v38 master. The output images match as do all render statistics e16458c3 to 3f56c5fb.

@LeForgeron, I've explored more here and my thoughts currently more or less echoing your last paragraph.

The commit 725efe1f is certainly involved. My leading suspect there is POV_MALLOC and POV_FREE being replaced by c++ methods. An update that happened in quite a few more files than the bbox priority queue code for that commit. When I compare the current v37 stable the master there looks to be additional slowdown factors - not entirely unexpected as there is a slowdown v37 to v38.

The 725efe1f vector backed BBoxPriorityQueue::Insert and BBoxPriorityQueue::RemoveMin are noticeably slower, but the values I've seen don't come close to accounting for the whole of the slow down. I had the thought to run the following experiments to avoid that code (-Og perf tooling oriented compiles).

+bm2  
---
10.27user 0.01system 0:10.70elapsed  v37
13.37user 0.03system 0:13.80elapsed  v38
10.27 -> 13.37 ---> 30.19%

Usual slabs method 
---
11.10user 0.02system 0:11.55elapsed v37
15.21user 0.03system 0:15.66elapsed v38
11.10 -> 15.21 ---> 37.03%

Again slabs, but using +mb50 turn turn off bounding(1). 
---
 9.16user 0.02system 0:09.60elapsed v37
12.18user 0.03system 0:12.66elapsed v38
 9.16 -> 12.18 ---> 32.97% 

(1) Without my -mb patch branch there is always a bbox find intersection around object 
even with bounding off by +mb<count> or -mb in both v37 stable and master.

bm2 (slabs -> bm2) always faster for this test scene.   
---
11.10 -> 10.27 v37 ---> -7.48%
15.21 -> 13.37 v38 ---> -12.10%

Also hurriedly tried a solid red texture and got a surprise.

slabs. solid red. no AA (Still slows, but less so than transparent)  
---
1.92user 0.03system 0:02.25elapsed v37
2.45user 0.04system 0:02.81elapsed v38
1.92 -> 2.45 ---> 27.60%

slabs. solid red. With heavy AA. (Yep. Good news, but what heck...)
---
7.31user 0.06system 0:07.66elapsed v37
4.57user 0.05system 0:04.92elapsed v38
7.31 -> 4.57 ---> -37.48%

I've looked at things quite a few different ways with the perf tool set and a few extensions others have written, but not much with +bm2, no bounding or the AA solid-color surprise which I did late. The same AA with transparency is 20% slower instead of 37% slower mirroring what was noted in the original issue post. Lastly, if the memory allocation changes the major cause I've not yet seen it in the perf output - simpler stuff anyway like counting whatever cache-misses counts shows few misses compared to references though some difference. The perf tool set options go deep. Perhaps I just haven't come up with a way to see differences in memory allocation / free performance.

I'll not be able get back to this until later this week. Next up I want to try the gprof tool as I remember it being a little better at pointing at slowing in specific functions.

Edit: Aug 14, 2019.

I wake up and blast my old man's memory! Forget the AA speed up / mitigation discussed previously. I'm seeing my own AA fix from 2+ years back. 3adbf759 & pull request #270. Use AA method 1 to avoid the method 2 speed up fix.

An update.

The bad news. For users getting pre-compiled versions of POV-Ray where the compile target is generic, the v37 to v38 performance penalty is 30-40% or roughly double the 18+% degrade originally reported.

Further, the degrade here is a rather typical use case. One not well represented by the benchmark scene which is dominated by noise and media performance. I now believe the performance issue here needs to be addressed prior to any final 3.8 release.

The good news. I've had success recovering a part of the BBoxPriorityQueue performance degrade by initially reserving a larger chunk of std:vect memory. Namely, using again INITIAL_PRIORITY_QUEUE_SIZE in boundingbox.cpp as in:

BBoxPriorityQueue::BBoxPriorityQueue()
{
    mQueue.reserve(INITIAL_PRIORITY_QUEUE_SIZE); // Added this line
    mQueue.resize(BBQ_FIRST_ELEMENT); // element 0 is reserved
}

where today the master branch initializes to a size of 1. With the code change above I ran through a series of initial allocations as follows measuring performance:

(/usr/bin/time povrTRY -j -fn -d -p +wt2 +a0.0 +am1 +r3 +W1600 +H1000 FS14.pov)

 1   109.05user 0.03system 0:55.11elapsed
 4   108.43user 0.02system 0:54.80elapsed
 8   108.02user 0.03system 0:54.56elapsed
12   107.74user 0.04system 0:54.45elapsed  (min scene needs) 
16   107.72user 0.03system 0:54.41elapsed
20   107.70user 0.04system 0:54.41elapsed   
48   107.80user 0.05system 0:54.51elapsed -1.15%

Further, I've learned how to better see cache behaviour with perf and can clearly see, looking at 1 vs 48, a 31% reduction in L1 cache loads. The locality of reference is much improved when the initial memory allocation is larger and continuous. More often the memory needed next is in the cache line(s) already in the local cpu buffers.

Anyway, we know the BBoxPriorityQueue a significant part of the slow down and speeding things back up is doable. The change above if nothing else. Using simplevector::vector over std::vector or other better implementations, maybe. Faster code here will help and comparatively it's not to hard to do.

The murky news. Running gprof pointed to the same things as perf excepting when I turned off optimization completely. There the slowdown culprit was clearly our updated vector template class. Of course, typically, those operations are inlined by even low level optimization. Pointedly though, after such inlining optimizations, the higher level functions slowing the most are, primarily, those into which these inlines happen.

I've selectively looked at assembly annotated code. It's clear there is some inefficiency due the [] overloading (the g++ compiler doesn't see - or is unable (alias effects?) to see - it can do better than three loads from memory into the registers on cpus like mine. Further, we are now always initializing allocated vectors to 0 where it's very often unnecessary. Where I've looked, the zero setting is also getting done by one opcode for each double in the vector.

I was able in the first '[]' case to modify the template slightly to get one memory to register load instruction in changing the '=' overload b part from b[X] say to b.x() and so on for y and z. In other words the g++ compiler was able to then optimize for whatever reason(s). Unfortunately, this actual change didn't run faster though I was able to measure a reduction in instructions run! In hindsight I was doing my typical quiet machine +wt1 single thread comparisons which tends to bind per cpu resources less.

Though murky, I now suspect the updated vector template class in addition to changes in memory allocation to c++ methods and the bbox priority queue stuff. However, proving particular whys and coming up with improvements is going to be difficult - especially given I'm still a c++ hack programmer and not a real one.

The gprof work does show, for example, Ray::~Ray() slowing by 850% in the solid red case and 600% in the original partly transparent spheres case. So, evidence exists memory allocation / de-allocation performance is degraded v37 to v38, but it doesn't seem to be that large a part of the entire slow down as I'm currently measuring things - 1 or 2% of the 18% maybe.

Aside: I think we could help render performance some by capturing more information during parsing. It should be we know how many lights we have; that we can determine an upper bound for a vector size needed in BBoxPriorityQueue. Might we be able to shrink wrap some vectors to the necessary size when parsing is done for better performance. Ideas for another day...

Next up? I might see if I can hack together different bbox priority queue code not using any vectors just to see how much performance recovery might be possible with changes there.

In the previous post I noted playing with changes to the vector template class which allowed the compiler to better optimize on non-generic x86 architectures (for SIMD) and that I saw no performance improvement despite seeing a reduction in instructions.

This does depend on hardware thread count - and for thread counts <= core counts conflicts performance wise with the BBoxPriorityQueue std:vect reserve change in a way which results in no performance gain at 1 thread and something more than 1% faster for 4 threads on my dual core i3. At one thread without the std::vect reserve change there is a 0.25% improvement .

The updates to vector.h are part of why we have long seen more slow down at higher thread counts vs lower one with v38 compared to v37. The FS14.pov scene in fact runs no faster with thread counts > the core count in v38 - excepting a very tiny bit faster with my updates to the new vector template class.

The minimal changes I made to the vector template class are probably worth doing as is avoiding the [ ] overloads wherever possible throughout the code(1). Most users will just run POV-Ray which means threads > cores. For users compiling to a target supporting SIMD of some sort, things will run faster. Further, compiling more code to targets is something we should be looking to do selectively for more than noise in gnu tooling based compiles. We are otherwise not taking best advantage of newer processors.

Today users just installing POV-Ray are getting a non-SIMD generic compile - excepting for the noise optimizations done by AMD and Intel - and the sorts of changes I've made to the vector template class won't help those users. Won't hurt either I guess...

I've looked a little at other vector template classes and high performance ones are difficult to do well. As with other aspects of the slow down, now know the updated vector.h is involved, but I don't know how large a part the vector.h updates play.

(1) - FYI found someone else had long ago put a comment in the code saying as much, so I'm not the first to see the .[X] to .x() etc bit.

POV-Ray / povray