This should (hopefully) run slightly faster than the latest Julia and Java implementation. This makes count_flips (the bottleneck) slightly faster. The parallelization is now done the same way as in Jeremy Zerfas' C implementation, changing block sizes, removing atomic operations and using reduction after threads join.
Unfortunately. I forgot removing @time from the last version, amended the last commit on my branch and then forced push. I think GitHub pull requests no longer support this. Should I create new PR?
This should (hopefully) run slightly faster than the latest Julia and Java implementation. This makes
count_flips
(the bottleneck) slightly faster. The parallelization is now done the same way as in Jeremy Zerfas' C implementation, changing block sizes, removing atomic operations and using reduction after threads join.