PlummersSoftwareLLC / Primes

Prime Number Projects in C#/C++/Python
https://plummerssoftwarellc.github.io/PrimeView/
2.44k stars 575 forks source link

Improve performance of my 1-bit Clojure solution #891

Closed axvr closed 1 year ago

axvr commented 1 year ago

Description

This change significantly improves performance of my 1-bit Clojure solution. Instead of using the Java BitSet class, I have created my own by representing the bit-array as the bits of longs in an array of longs.

I believe this change makes this the fastest 1-bit Clojure solution, and should beat any Java solutions using BitSet.

Contributing requirements

rbergen commented 1 year ago

@axvr Thanks for your submission. I've checked both versions (current and the one in this PR), and at least on my system the BitSet-based version is significantly faster than the other one.
With the current codebase, this is the output I get when running the solution in Docker:

Passes: 8502, Time: 5.001292454, Avg: 5.8824895E-4, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_8-bit;8502;5.001292454;1;algorithm=base,faithful=yes,bits=8
Passes: 4349, Time: 5.000378914, Avg: 0.0011497767, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit;4349;5.000378914;1;algorithm=base,faithful=yes,bits=1

With the new version, the output is this:

Downloading: org/clojure/clojure/1.12.0-alpha1/clojure-1.12.0-alpha1.pom from central
Downloading: org/clojure/clojure/1.12.0-alpha1/clojure-1.12.0-alpha1.jar from central
Passes: 8142, Time: 5.001105132, Avg: 6.1423547E-4, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_8-bit;8142;5.001105132;1;algorithm=base,faithful=yes,bits=8
Passes: 2944, Time: 5.001365545, Avg: 0.0016988334, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit;2944;5.001365545;1;algorithm=base,faithful=yes,bits=1

As you see, the number of passes per 5 seconds goes down from 4349 to 2944 for the 1-bit version. I've run both versions multiple times to make sure these results don't include any flukes. Although the exact numbers vary a little between runs, the new version is consistently slower than the current version.

As you are opening a PR on your own solution, I'm willing to still merge this in principle. However, considering the results I'm seeing, I'd like to double check if that's really what you want me to do.

axvr commented 1 year ago

Thanks for testing it. That is very odd, whenever I run it on my machine (Apple M1), it is defintiely quite a bit faster. I imagine the difference must be from optimisations being made by the processor (assuming you are using an x86 CPU) rather than the JVM since this behaviour is the same when I run it via Docker too.

I've added the old version back now with the new solution as an alternative to it, so even if it does benchmark worse, it won't matter too much.


Here are the results on my machine, and they are very consistent:

Passes: 7829, Time: 5.000277000, Avg: 6.386865E-4, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_8-bit;7829;5.000277000;1;algorithm=base,faithful=yes,bits=8
Passes: 3760, Time: 5.000661000, Avg: 0.001329963, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit;3760;5.000661000;1;algorithm=base,faithful=yes,bits=1
Passes: 4012, Time: 5.001308000, Avg: 0.0012465873, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit_custom;4012;5.001308000;1;algorithm=base,faithful=yes,bits=1

(axvr_clj_1-bit_custom is the new solution added in this PR.)

rbergen commented 1 year ago

Ok, this is getting interesting now. With your updated PR, this is what I'm looking at:

Downloading: org/clojure/clojure/1.12.0-alpha1/clojure-1.12.0-alpha1.pom from central
Downloading: org/clojure/clojure/1.12.0-alpha1/clojure-1.12.0-alpha1.jar from central
Passes: 8849, Time: 5.002072265, Avg: 5.652698E-4, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_8-bit;8849;5.002072265;1;algorithm=base,faithful=yes,bits=8
Passes: 1869, Time: 5.000449156, Avg: 0.0026754676, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit;1869;5.000449156;1;algorithm=base,faithful=yes,bits=1
Passes: 3198, Time: 5.000829353, Avg: 0.0015637365, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit_custom;3198;5.000829353;1;algorithm=base,faithful=yes,bits=1

It looks like in the context of your branch, your custom solution is indeed significantly faster than the "old" BitSet-based approach, but still slower than the old approach in the current codebase. Could this possibly be rooted in the differences between base Docker images (clojure:openjdk-18-tools-deps-alpine -> clojure:temurin-19-tools-deps-bullseye-slim) and/or Clojure versions (1.10.3 -> 1.12.0-alpha1)?

And yes, I am indeed running things on an x86_64 machine. As do all benchmark runners, with the exception of @marghidanu's Raspberry Pi 4.

axvr commented 1 year ago

Hm... The reason I changed the base image was because (for some reason) my Docker can't fetch the Alpine images... 🤔

image

I have compared it between the Clojure versions and there is no difference.

Testing on these images seems to have no difference:

Which implies that it is not a JVM difference, but must be an Alpine vs. Debian difference.

Unfortunately since I can't test the Alpine images myself I'm unable to check if this is the case. I'd very much appreciate it if you could test these base images for me:

rbergen commented 1 year ago

Results with clojure:openjdk-18-tools-deps-alpine:

Downloading: org/clojure/clojure/1.12.0-alpha1/clojure-1.12.0-alpha1.pom from central
Downloading: org/clojure/spec.alpha/0.3.218/spec.alpha-0.3.218.pom from central
Downloading: org/clojure/pom.contrib/1.1.0/pom.contrib-1.1.0.pom from central
Downloading: org/clojure/core.specs.alpha/0.2.62/core.specs.alpha-0.2.62.pom from central
Downloading: org/clojure/spec.alpha/0.3.218/spec.alpha-0.3.218.jar from central
Downloading: org/clojure/clojure/1.12.0-alpha1/clojure-1.12.0-alpha1.jar from central
Downloading: org/clojure/core.specs.alpha/0.2.62/core.specs.alpha-0.2.62.jar from central
Passes: 7999, Time: 5.000429568, Avg: 6.251318E-4, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_8-bit;7999;5.000429568;1;algorithm=base,faithful=yes,bits=8
Passes: 3936, Time: 5.001792377, Avg: 0.0012707806, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit;3936;5.001792377;1;algorithm=base,faithful=yes,bits=1
Passes: 3078, Time: 5.001509076, Avg: 0.0016249217, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit_custom;3078;5.001509076;1;algorithm=base,faithful=yes,bits=1

Results with clojure:temurin-19-tools-deps-alpine:

Downloading: org/clojure/clojure/1.12.0-alpha1/clojure-1.12.0-alpha1.pom from central
Downloading: org/clojure/clojure/1.12.0-alpha1/clojure-1.12.0-alpha1.jar from central
Passes: 7846, Time: 5.000677124, Avg: 6.373537E-4, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_8-bit;7846;5.000677124;1;algorithm=base,faithful=yes,bits=8
Passes: 1720, Time: 5.002369853, Avg: 0.0029083546, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit;1720;5.002369853;1;algorithm=base,faithful=yes,bits=1
Passes: 2776, Time: 5.001448540, Avg: 0.0018016746, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit_custom;2776;5.001448540;1;algorithm=base,faithful=yes,bits=1

It looks to me like it's the JDK that makes the biggest difference on one hand, and the CPU architecture does on the other. Interestingly enough, on x86_64 with OpenJDK, the BitSet approach is still faster than the custom one.

There is another thing, which is that the docker run commands seem to download dependencies. Would it be possible to make that part of the docker build stage, so that they're included in the Docker image that's built?

axvr commented 1 year ago

I've pushed a change switching it to Clojure 1.11.1, which should (for most images at least) prevent it pulling down dependencies at runtime.

I have re-ran the results with many different JDK builds and versions on my Mac and have seen some really confusing/unexpected results.

JDK 11 (Temurin + Corretto build)

Passes: 7847, Time: 5.000824000, Avg: 6.372912E-4, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_8-bit;7847;5.000824000;1;algorithm=base,faithful=yes,bits=8
Passes: 3550, Time: 5.002020000, Avg: 0.0014090197, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit;3550;5.002020000;1;algorithm=base,faithful=yes,bits=1
Passes: 939, Time: 5.000984000, Avg: 0.0053258617, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit_custom;939;5.000984000;1;algorithm=base,faithful=yes,bits=1

JDK 8 (Corretto and Temurin build)

Passes: 10277, Time: 5.001000000, Avg: 4.866206E-4, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_8-bit;10277;5.001000000;1;algorithm=base,faithful=yes,bits=8
Passes: 3729, Time: 5.001000000, Avg: 0.0013411102, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit;3729;5.001000000;1;algorithm=base,faithful=yes,bits=1
Passes: 883, Time: 5.004000000, Avg: 0.0056670443, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit_custom;883;5.004000000;1;algorithm=base,faithful=yes,bits=1

JDK 17+ (All builds)

Passes: 7743, Time: 5.000399000, Avg: 6.4579607E-4, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_8-bit;7743;5.000399000;1;algorithm=base,faithful=yes,bits=8
Passes: 3724, Time: 5.001522000, Avg: 0.001343051, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit;3724;5.001522000;1;algorithm=base,faithful=yes,bits=1
Passes: 3989, Time: 5.001497000, Avg: 0.0012538222, Limit: 1000000, Count: 78498, Valid: True
axvr_clj_1-bit_custom;3989;5.001497000;1;algorithm=base,faithful=yes,bits=1
axvr commented 1 year ago

(Updated the above message with Docker image for JDK 8.)

rbergen commented 1 year ago

I'm not exactly sure where that leaves us when it comes to this PR?

axvr commented 1 year ago

I'm not sure either. I think I'll just close this as the base image change has clearly made things worse.

That said, it was still interesting seeing how much of a difference the OS, architecture and JVM version have on performance. Really shows how unpredictable performance becomes upon all these layers of abstraction.