Closed hycakir closed 5 years ago
For me:
Before:
fannkuchredux-fast.jl 1 8.62s 22.1%
After:
fannkuchredux-fast.jl 1 7.91s 21.3%
fannkuchredux-fast.jl 1 7.91s 21.3%
If you are running with 8 threads or more, changing block size to 16, 24 or 32 should also help. The machine they use with benchmarking uses 4 threads (telling from the results). That's why I set it to 12.
This should (hopefully) run slightly faster than the latest Julia and Java implementation (needs a test though). This makes count_flips (the bottleneck) slightly faster. The parallelization is now done the same way as in Jeremy Zerfas' C implementation, changing block sizes, removing atomic operations and using reduction after threads join.