Feature/fast - Githubissues

hroest commented 5 years ago

An exploration into some micro-optimizations for the IsoThresholdGenerator which I did over the weekend and did not have time to upload yet. Also added a bugfix. The basic idea is to replace some of the computation in the tight loop with ptr operations. Also I added a pure counter that does not do the multiplication and extra addition for the masses and exp probabilities. On the attached benchmark code I get:

$ g++ -std=c++11 -Wall -I../../IsoSpec++ -Wextra -pedantic -lpthread   -pthread  -O3 benchmark.cpp    -o benchmark
$ /usr/bin/time ./benchmark  
2009674675
Using IsoThresholdGeneratorCntr, it took 2.18512s 
2009674675
Using IsoThresholdGeneratorFast, it took 5.1261s 
2009674675
Using legacy, it took 6.54023s 
13.85user 0.00system 0:13.85elapsed 99%CPU (0avgtext+0avgdata 3888maxresident)k
0inputs+0outputs (0major+183minor)pagefaults 0swaps

So on this benchmark there is a nice speed gain of 27% going from 6.5 s to 5.12 s in iterating over all configurations (2009674675 or 2.0E9 configurations). Note that this translates to reducing the time per configuration from 3.25 ns to 2.54 ns per iteration. Note that in the "count only" code removing the extra addition and multiplication brings this down to 2.18 s (or 1.08 ns). Note that on my CPU one cycle is about 0.3 ns. We can shave off another 0.15 to 0.1 s when we use while (LIKELY(generator.advanceToNextConfiguration())) size++; but at this point the measurement variance becomes higher than the effect size.

Interestingly, for clang I dont see speed gains, however it is anyways 2x slower so maybe that is not as relevant. It seems that it does not fully optimize the code and function call.

$ clang++ -std=c++11 -Wall -I../../IsoSpec++ -Wextra -pedantic -lpthread   -pthread  -O3 benchmark.cpp    -o benchmark
$ /usr/bin/time ./benchmark  
2009674675
Using IsoThresholdGeneratorCntr, it took 4.91654s 
2009674675
Using IsoThresholdGeneratorFast, it took 9.50922s 
2009674675
Using legacy, it took 9.48117s 
23.90user 0.00system 0:23.91elapsed 99%CPU (0avgtext+0avgdata 3796maxresident)k
0inputs+0outputs (0major+182minor)pagefaults 0swaps

Looking at the assembly, clang for whatever reason decides to have a full function call for each generator.advanceToNextConfiguration which produces a lot of overhead. However, even when replacing the function call with a single tight loop, I can only get clang down to 3.0 seconds for the Cntr loop which is still substantially slower than the 2.1 seconds of gcc but it explains where most of the slowdown comes from (failure to inline a function). Whereas gcc does not profit from the single tight loop as it already inlined the function in the first place.

PS: I promise these are the last changes from my side :-)

hroest commented 5 years ago

Some additional benchmarks from MSVC (this was done on a different system than the one above):

>benchmark.exe
2009674675
Using IsoThresholdGeneratorCntr, it took 5.256s
2009674675
Using IsoThresholdGeneratorFast, it took 8.96s
2009674675
Using legacy, it took 11.402s

so on MSVC we get an almost identical 27% speedup.

hroest commented 5 years ago

PS: I dont really know how to best propose this here, if you also see consistent speed gains on your computers then I suggest to actually drop the IsoThresholdGeneratorFast and just integrate the changes in the IsoThresholdGenerator as this makes most sense. For benchmarking its of course great to have both.

michalsta commented 5 years ago

Looks great at a glance, I didn't know about the __builtin_expect thing. You too seem to have caught the "let's see if there's any drop performance we can still squeeze out of it" bug ;)

I'll try to have a more in-depth look as soon as I can, but for now I'll probably have to let it hang for a day or a few though...

hroest commented 5 years ago

with the latest clang fix, gcc and clang now perform equal on my machine:

$ clang++ -std=c++11 -Wall -I../../IsoSpec++ -Wextra -pedantic -lpthread   -pthread  -O3 benchmark.cpp  -march=native  -mtune=native -o benchmark
$ /usr/bin/time ./benchmark 
2009674675
Using IsoThresholdGeneratorCntr, it took 3.05633s
2009674675
Using IsoThresholdGeneratorFast, it took 4.92318s
2009674675
Using legacy, it took 9.34187s
17.32user 0.00system 0:17.32elapsed 100%CPU (0avgtext+0avgdata 3652maxresident)k
0inputs+0outputs (0major+182minor)pagefaults 0swaps
$ g++ -std=c++11 -Wall -I../../IsoSpec++ -Wextra -pedantic -lpthread   -pthread  -O3 benchmark.cpp  -march=native  -mtune=native -o benchmark
$ /usr/bin/time ./benchmark 
2009674675
Using IsoThresholdGeneratorCntr, it took 2.20852s
2009674675
Using IsoThresholdGeneratorFast, it took 5.25915s
2009674675
Using legacy, it took 6.62088s
14.08user 0.00system 0:14.08elapsed 99%CPU (0avgtext+0avgdata 3716maxresident)k
0inputs+0outputs (0major+181minor)pagefaults 0swaps

it seems gcc is faster for the counter, but clang is slightly faster for the full analysis. Note how the speed gain for clang is almost 2x here. Aggressive inlining here really helps and our knowledge that for most loop iteration we jump at the first return statement while the compiler may not know this and consider this a "large" function while in practice it is only "large" every N iterations and otherwise it is "small" and would benefit from inlining.

hroest commented 5 years ago

ok, MSVC is now also faster with a gain 1.5x compared to legacy with inlining:

>benchmark.exe
2009674675
Using IsoThresholdGeneratorCntr, it took 4.804s
2009674675
Using IsoThresholdGeneratorFast, it took 7.568s
2009674675
Using legacy, it took 11.488s

note that this optimization is potentially dangerous as it is fitting for the benchmark but may not work well for other code. However, I think that the compiler is confused about the function as it looks like a large function that is not worth inlining but does not notice the internal structure as explained above, so in most cases forcing inlining is correct. However, we dont know what clients will do with the function.

hroest commented 5 years ago

I'll try to have a more in-depth look as soon as I can, but for now I'll probably have to let it hang for a day or a few though...

sure, no rush here :-)

hroest commented 5 years ago

Interestingly, clang manages to produce the tight inner loop with 22 instructions while gcc produces 23 instructions. While a small difference, it does lead to a 5% performance difference:

$ perf stat -r 1 -e cycles,instructions ./benchmark                                                                                       2009674675
Using IsoThresholdGeneratorFast, it took 4.92168s

 Performance counter stats for './benchmark':

    17,213,148,154      cycles
    48,779,204,652      instructions              #    2.83  insns per cycle

       4.924153197 seconds time elapsed
$ perf stat -r 1 -e cycles,instructions ./benchmark                                                                                       2009674675
Using IsoThresholdGeneratorFast, it took 5.20204s

 Performance counter stats for './benchmark':

    18,282,782,248      cycles
    50,800,086,802      instructions              #    2.78  insns per cycle

       5.204852910 seconds time elapsed

as you can see there is exactly 2e9 fewer instructions for clang (48e9 vs 50e9) and on top of that, the clang code is somehow more efficient and leads to more instructions / cycle.

MatteoLacki / IsoSpec

Feature/fast #18