PlummersSoftwareLLC / Primes

Prime Number Projects in C#/C++/Python
https://plummerssoftwarellc.github.io/PrimeView/
2.44k stars 575 forks source link

Optimized PrimeC/solution 5 - 50% speed increase and multiprocessor supported #866

Closed rogiervandam closed 1 year ago

rogiervandam commented 1 year ago

Optimized the extend algorithm in C - up to 30% speed increase depending on platform. Supports multiprocessor (Gustafson scaling / embarrassingly parallel)

Description

Highly optimized version for C using word and vector unrolling, hints and other tweaks. Multiprocessor also supported.

Contributing requirements

rogiervandam commented 1 year ago

Found an error

rogiervandam commented 1 year ago

Solved an issue with #pragma ivdep causing invalid sieves on some platforms. Removed -malign-mem=cacheline which did not work on ARM Macos in docker

rogiervandam commented 1 year ago

32k is the default. The algorithm tries to find the right setting in the tuning phase, around the 2^ sizes. On my machines there is little difference between different block sizes. But I’ll try the 24k setting. Thanks!

Regards, Rogier

Van: Mooshua @.> Datum: dinsdag, 8 november 2022 om 20:11 Aan: PlummersSoftwareLLC/Primes @.> CC: Rogier van Dam @.>, State change @.> Onderwerp: Re: [PlummersSoftwareLLC/Primes] Optimized PrimeC/solution 5 - 30% speed increase and multiprocessor supported (PR #866)

@Mooshua commented on this pull request.

What made you choose a blocksize of 32? Did you actually benchmark it or is it just a loose heuristic?

The reason I ask is that on PrimeLua 3, a block size of 24kb (“c48k” was found to be fastest even on platforms with large amounts of cache (32kb trailed slightly behind that).

— Reply to this email directly, view it on GitHubhttps://github.com/PlummersSoftwareLLC/Primes/pull/866#pullrequestreview-1172698798, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAJQ7RPYS2QBE7YQQ26RKZDWHKQVHANCNFSM6AAAAAARYOTRLM. You are receiving this because you modified the open/close state.Message ID: @.***>

Mooshua commented 1 year ago

Interesting. Thanks for humoring my question! :-)

rbergen commented 1 year ago

@rogiervandam Quick note from my end concerning process: I'm going to wait for this PR to be stable for a few days at least before I proceed with reviewing and merging it.

rogiervandam commented 1 year ago

Actually i’m almost done and love to see the performance on your machines. 

rogiervandam commented 1 year ago

No more changes expected in this pull request.

rbergen commented 1 year ago

@rogiervandam When I run this version of the solution, I get the following output:

rogiervandam_extend;89299;5.000053;1;algorithm=other,faithful=yes,bits=1
rogiervandam_extend_epar;247686;5.000149;8;algorithm=other,faithful=yes,bits=1
rogiervandam_extend_epar;195215;5.000084;4;algorithm=other,faithful=yes,bits=1
rogiervandam_extend_epar;124797;5.000067;2;algorithm=other,faithful=yes,bits=1
rogiervandam_extend_epar;76113;5.000026;1;algorithm=other,faithful=yes,bits=1

As shown, 4 of the 5 runs use the same label rogiervandam_extend_epar. This will hinder the proper processing of the solution's results by the automated benchmark tooling. Please ensure that each output line uses a unique label.

rogiervandam commented 1 year ago

@rogiervandam When I run this version of the solution, I get the following output:

rogiervandam_extend;89299;5.000053;1;algorithm=other,faithful=yes,bits=1
rogiervandam_extend_epar;247686;5.000149;8;algorithm=other,faithful=yes,bits=1
rogiervandam_extend_epar;195215;5.000084;4;algorithm=other,faithful=yes,bits=1
rogiervandam_extend_epar;124797;5.000067;2;algorithm=other,faithful=yes,bits=1
rogiervandam_extend_epar;76113;5.000026;1;algorithm=other,faithful=yes,bits=1

As shown, 4 of the 5 runs use the same label rogiervandam_extend_epar. This will hinder the proper processing of the solution's results by the automated benchmark tooling. Please ensure that each output line uses a unique label.

I expected that the other characteristics on the same line - like the number threads - would contribute to the entry in the benchmark tooling. Now that i know it doesn't, I made a new version to address the issue; this version also has multiple complile options. On some machines, using uint32 seems to be faster than u64 hence the multiple configurations.

Now i am very curious how this submission performs on your Gen Intel® Core™ i3-1125G4, as i don't have a Avx512 machine. On my M1 mac - which is unfortunately is not in your benchmark reports - i touched the 96k/5sec milestone. Will try to get it to o 100k but that needs some more time and tweaking..

rbergen commented 1 year ago

The reason we use the label as "the" identifier for an implementation is that implementations can develop over time. One of the things we (the maintainers) are exploring is comparing the performance of solutions/implementations across the time axis. That kind of requires that each implementation is uniquely identifiable by its name, even if its characteristics change.

Thank you for your update addressing this. However, we now run into another issue. The number of runs in the solution has expanded from 5 to 25, which makes that the solution's overall benchmark duration exceeds 4 minutes, We've more or less settled on the maximum benchmark duration for any one solution to be one minute, unless there are specific reasons to deviate from that. The general request to contributors is to "cherry pick" their variations to stay within (or reasonably close to) the one minute timeframe, which is thus what I am now kindly requesting you to do as well.

Results on my i3 will be available the first morning after we merge your PR. Concerning results on an M1; I'd love to add them, but I don't have an M1 machine available to run the benchmarks on, nor have we been offered M1s to use for this purpose. I guess this project is not particularly sponsor-friendly. :)

rogiervandam commented 1 year ago

Thank you for the fast response. I reduced the number of alternatives and reduced the internal benchmark time to stay within one minute.