bsless / clj-fast

Unpredictably faster Clojure
Eclipse Public License 2.0
234 stars 1 forks source link

Array cloning performance diffs #19

Closed joinr closed 3 years ago

joinr commented 3 years ago

Doing some performance stuff for a local search optimization via simulated annealing. My travels included exploring different numeric representations and trade offs, to include using COW for numeric arrays in some places vs. persistent vector/transient vector defaults. It appears aclone is slow for some unknown reason (to me!):

(set! *unchecked-math* true)

(defn cow-update-slow [^longs arr ^long idx ^long v]
  (let [^longs res (aclone arr)]
    (aset res idx v)
    res))

user> (let [xs (long-array 15)] (c/quick-bench (cow-update-slow xs 10 2)))
             Execution time mean : 114.164498 ns
    Execution time std-deviation : 0.747223 ns
   Execution time lower quantile : 113.189737 ns ( 2.5%)
   Execution time upper quantile : 114.879358 ns (97.5%)
                   Overhead used : 11.166062 ns

(defn cow-update [^longs arr ^long idx ^long v]
  (let [^longs res (java.util.Arrays/copyOf arr (alength arr))]
    (aset res idx v)
    res))

user> (let [xs (long-array 15)] (c/quick-bench (cow-update xs 10 2)))
Evaluation count : 12961884 in 6 samples of 2160314 calls.
             Execution time mean : 35.007116 ns
    Execution time std-deviation : 0.240233 ns
   Execution time lower quantile : 34.668212 ns ( 2.5%)
   Execution time upper quantile : 35.291867 ns (97.5%)
                   Overhead used : 11.166062 ns
nil

java.util.Arrays/copyOf will beat it handily, and performance is a hair faster (for small arrays) than normal persistent vector's hinted assocN for a similar operation. I always thought aclone was more or less optimal....apparently not.

bsless commented 3 years ago

This is weird, I can't recreate the results. Getting the exact same performance for both cases. Which JVM and Clojure versions were you using?

joinr commented 3 years ago

Yeah, I'm seeing similar results. This (was) on java 8 on ubuntu's JDK, whatever variant that was. I just ran it on W10 with adopt open jdk, 1.8.0_222, and had like a 1 ns difference. Considering it anomalous for now (or jdk specific...)

bsless commented 3 years ago

I ran my benchmarks on Ubuntu 20.10 with Java 15. If this was about the JDK version, 8 is considered EOL and we can happily forget about it. If you have no objections I'm closing this issue, but do reopen it if you find it crops up again

joinr commented 3 years ago

, 8 is considered EOL and we can happily forget about it.

Not so. There are many many large orgs still running Java 8 for LTS (part of the reason amazon has their own, among others). Some places (including mine) are still there for the foreseeable future. Just FYI.

bsless commented 3 years ago

Well then, I'll try to recreate

joinr commented 3 years ago

I'll re-run on the original machine. an ubuntu machine on ec2.

bsless commented 3 years ago

Reran with:

;; CIDER 1.1.0snapshot (package: 20210416.1915), nREPL 0.8.3
;; Clojure 1.10.1, Java 1.8.0_282

Ubuntu 20.10 Got same results again, exactly same performance in both cases

joinr commented 3 years ago

I just re-ran it on the same setup, and reproduced:

Execution time mean : 121.353224 ns Execution time mean : 36.097984 ns

joinr commented 3 years ago

openjdk version "1.8.0_282" OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~16.04-b08) OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)

bsless commented 3 years ago

Weird, can you print out all the options and flags the JVM ran with?

joinr commented 3 years ago

running from lein repl with default options (no :jvm-opts entry).

joinr commented 3 years ago

lein 2.9.1, not that it matters.

joinr commented 3 years ago

Also running this on an EC2 instance; could be some throttling, but I'd expect that to be amortized over the multiple runs.

bsless commented 3 years ago

The JVM still has some default configurations, such as GC, server, client or tiered etc. There's a way to print out all of them, something like https://alvinalexander.com/java/how-see-jvm-parameters-arguments-from-running-java-application/

joinr commented 3 years ago

[:arg -Dfile.encoding=UTF-8] [:arg -XX:-OmitStackTraceInFastThrow] [:arg -XX:+TieredCompilation] [:arg -XX:TieredStopAtLevel=1] [:arg -Dclojure.compile.path=/home/tom/repos/spork/target/classes] [:arg -Dspork.version=0.2.1.4-SNAPSHOT] [:arg -Dclojure.debug=false]

joinr commented 3 years ago

this particular repo running clojure 1.10.1

bsless commented 3 years ago

Okay, I managed to recreate. One of these is the culprit:

"-Djdk.attach.allowAttachSelf" "-XX:+UnlockDiagnosticVMOptions" "-XX:+DebugNonSafepoints"
bsless commented 3 years ago

Printed the options with and without passing JVM opts flag:

;;; no jvm opts
["-Dfile.encoding=UTF-8"
 "-XX:-OmitStackTraceInFastThrow"
 "-XX:+TieredCompilation"
 "-XX:TieredStopAtLevel=1"
 "-Dclj-fast.version=0.0.10-SNAPSHOT"
 "-Dclojure.debug=false"]

;;; jvm opts
["-Dfile.encoding=UTF-8"
 "-Djdk.attach.allowAttachSelf"
 "-XX:+UnlockDiagnosticVMOptions"
 "-XX:+DebugNonSafepoints"
 "-Dclj-fast.version=0.0.10-SNAPSHOT"
 "-Dclojure.debug=false"]

I suspect XX:TieredStopAtLevel

joinr commented 3 years ago

with :jvm-opts ["-XX:+UnlockDiagnosticVMOptions" "-XX:+DebugNonSafepoints"]

I get 20 ns for both runs now....both in lein repl and from cider.

bsless commented 3 years ago

Okay, tested with a variety of flags The culprit is -XX:TieredStopAtLevel For anything under 4 you'll get the performance difference because C2 doesn't kick in. It's probably used by lein to reduce start-up time for dev profile. I think lein run uses dev profile. I tried passing empty jvm-opts vector and got the same results for both, so I'm attributing this to dev profile.

joinr commented 3 years ago

Alright, yet more jvm corner case magic to understand. If nothing else, the issue is documented for posterity.

bsless commented 3 years ago

I wouldn't call it a corner case of the JVM, more of leiningen IMO - It's an undocumented flag which is injected when youlein run and messes with performance. It's in Lein's source code I think I'll open an issue about it.

bsless commented 3 years ago

I think I'll end up adding a few notes to the readme, nothing more I can do in the meanwhile

bsless commented 3 years ago

This is the best I could do in the meanwhile https://github.com/bsless/clj-fast/commit/bba3ec09a29b9929d523dbff8d57242fe3fa285f

Edit suggestions?

joinr commented 3 years ago

No, good writeup (and nugget of opt knowledge). Feel free to close.