PlummersSoftwareLLC / Primes

Prime Number Projects in C#/C++/Python
https://plummerssoftwarellc.github.io/PrimeView/
2.46k stars 573 forks source link

Request for a replication of observations #147

Closed TediusTimmy closed 3 years ago

TediusTimmy commented 3 years ago

My first though in trying to make a parallel sieve algorithm was to exchange time for memory. To that end, I changed the bit field that the program was operating on into a byte field: as the processor can address bytes, you don't have a read-modify-write when flagging a number as composite. It becomes just a write, which has no hazards. Which is to say, I changed vector bool to vector char .

What I noticed, though, and what I would like verified, is that the negative performance impact from this change is so large that no performance increase from multi-threading could ever recover from it. Does everyone else see that? Or, is my code just absolute garbage?

Thanks

rhbvkleef commented 3 years ago

For a 1 million sieve size, you would use either 1 MiB, vs 128 KiB. My CPU has 256KiB L2 cache, so using only one bit per entry would allow the entire sieve to be able to fit in L2 with half of it to spare, whilst if we use 1 byte per entry, it would not. I suspect that this is the cause. Try to run your program under perf to see your cache behaviour.

TediusTimmy commented 3 years ago

Thanks! That makes sense: the performance hit that I'm seeing is an accumulation of L2 cache misses. Thinking about it from the cache perspective, that may also explain some other weird behavior that I was seeing last night.

So, this may be a perfect example of where we try to trade-off using more memory for using less time, only to find that more memory IS more time.

TediusTimmy commented 3 years ago

If you want to see this monstrosity that appears to work, it is here: https://github.com/TediusTimmy/Primes/blob/main/PrimeCPP_PAR/PrimeCPP_Threaded.cpp

Kinematics commented 3 years ago

On the other hand, as your sieve size increases, the parallelization improvements start to outweigh the cache problems — especially as the total dataset gets to the size that even using bits per value starts overflowing the cache.

In my own tests, at 10,000,000 values parallel just started to outperform the linear version (by about 15%), while it increased to being 2.4x better at 1,000,000,000 values.

Edit: Oh, and you can keep the memory savings of the bitpacking if you change the way you parallelize things. Break things up by range of values tested rather than by prime factor.