Big slowdown on Java 16

plexus commented 3 years ago

Comparing Java 11 to Java 16, it seems on the latter MrAnderson is an order of maginitude slower than on the former.

$ sdk use java 11.0.10.hs-adpt
$ /usr/bin/time lein test
73.42user 1.79system 0:28.42elapsed 264%CPU (0avgtext+0avgdata 926716maxresident)k
27088inputs+37504outputs (126major+262412minor)pagefaults 0swaps

$ sdk use java 16.0.1.hs-adpt 
$ /usr/bin/time lein test
700.25user 1363.14system 5:06.12elapsed 674%CPU (0avgtext+0avgdata 393300maxresident)k
0inputs+18000outputs (157major+156929minor)pagefaults 0swaps

So 5 minutes instead of 28 seconds. A cider-nrepl build is taking half an hour on my machine.

Not sure if this is a red herring, but on the latter I'm getting this warning

OpenJDK 64-Bit Server VM warning: Options -Xverify:none and -noverify were deprecated in JDK 13 an
d will likely be removed in a future release.

benedekfazekas commented 3 years ago

thanks @plexus for letting me know, will have a look

plexus commented 3 years ago

Thank you @benedekfazekas! Observing a little with jstack I noticed a lot of clojure-agent-send-off-pool threads, over 200, with up to ~100 RUNNABLE at once. Perhaps worth looking at how futures are used.

benedekfazekas commented 3 years ago

was thinking about that and maybe check if need to upgrade parallel

vemv commented 3 years ago

From what I see parallel is only used for a single p/pmap call. Perhaps it can be simplified to a mere pmap?

It's what I use in friend projects like refactor-nrepl, Orchard etc.

core/pmap is smarter than it might seem - it doesn't process more items than CPUs are available, and reuses threads from a pool so one isn't constantly spinning new threads for inputs consisting of many items.

The only caveat is that it's lazy, so if going this route I'd recommend wrapping the pmap in doall or such.

vemv commented 3 years ago

And as Arne points out there's future usage as well.

That's where things seem to go awry, future (defn replace-ns-symbol) is ultimately called inside a pmap (defn replace-ns-symbol-in-source-files), so one gets a thread fest.

As a superficial impression/suggestion, I'd remove futures making those computations sync, and let the surrounding pmap take care of the parallelism. That way one gets rid of mostly-redundant thread spawning and bounds the thread count by CPU count.

If this sounds sensible I can PR these!

benedekfazekas commented 3 years ago

thanks for the analysis @vemv happy to get a PR on this

vemv commented 3 years ago

We simplified/reduced parallelism in https://github.com/benedekfazekas/mranderson/pull/60 .

No slowness could be repro'd in CI, however we kept seeing it in our laptops. Could be a macOS thing assuming you are also on macs?

bbatsov commented 2 years ago

Any updates here? I got reminded about this while trying to deploy https://github.com/clojure-emacs/cider-nrepl/releases/tag/v0.27.3 I'm using Ubuntu, btw. (via WSL)

bbatsov commented 2 years ago

Btw, might be good to cut a release with the improvement @vemv has already done.

benedekfazekas commented 2 years ago

not since @vemv 's latest comment. maybe would make sense to move creating the release candidate on CI.

release: I have two pending issues I want to solve/close before the next release. maybe this year.

bbatsov commented 2 years ago

Fair enough.

lread commented 2 years ago

FWIW, I experienced the slowness when running refactor-nrepl v3.2.2 (which now references 0.5.4-SNAPSHOT) tests on macOS 10.15.7 with JDK17.

$ java --version
openjdk version "17.0.2" 2022-01-18
OpenJDK Runtime Environment Temurin-17.0.2+8 (build 17.0.2+8)
OpenJDK 64-Bit Server VM Temurin-17.0.2+8 (build 17.0.2+8, mixed mode, sharing)

Switching to JDK11:

$ java --version                                                                                                                                                                    
openjdk 11.0.13 2021-10-19
OpenJDK Runtime Environment Temurin-11.0.13+8 (build 11.0.13+8)
OpenJDK 64-Bit Server VM Temurin-11.0.13+8 (build 11.0.13+8, mixed mode)

Made things nice and speedy again.

lread commented 2 years ago

Ok, I'm curious about this one. I'll use refactor-nrepl v3.5.5 as my guinea pig. It uses the MrAnderson 0.5.4-SNAPSHOT release.

Here's what I'll run:

lein clean
rm -f .inline-deps
/usr/bin/time lein with-profile -user,+1.10 inline-deps

(not sure that the rm -rf is relevant, but refactor-nrepl does it, so I'll do it too).

I'm on macOS 10.15.7, 3.5 GHz Quad-Core Intel Core i7, 32gb RAM. I am using skdman to choose JDKs.

sdk use java	timing results	human friendly elapsed	judgement
`11.0.16.1-tem`	`98.74 real 430.80 user 11.60 sys`	1m39s	good
`17.0.4.1-tem`	`82.93 real 344.64 user 14.26 sys`	1m23s	good
`17.0.2-tem`	`3456.73 real 5179.71 user 5256.95 sys`	57m37s	eek!
`17.0.3-tem`	`4219.20 real 5085.88 user 3904.21 sys`	1h10m19s	eek!

So whatever the cause of the slowdown, it does not exhibit itself for me on the current JDK17 release.

lread commented 2 years ago

One guess on the cause is this JDK bugfix, which is a backport of this JDK bugfix.

benedekfazekas commented 2 years ago

nice one. specially finding the jdk issues. can you check if setting mranderson.internal.no-parallelism sys property makes any difference on the poorly performing jdks?

suppose we can just add a note to the readme to use jdk 11 or rather > 17.0.4.1 and mention the above sys property if it helps.

lread commented 2 years ago

Parallelism Disabled

Good idea! A repeat of tests with parallelism disabled:

lein clean
rm -f .inline-deps
LEIN_JVM_OPTS=-Dmranderson.internal.no-parallelism=true /usr/bin/time lein with-profile -user,+1.10 inline-deps

sdk use java	timing results	human friendly elapsed	judgement
`11.0.16.1-tem`	`144.86 real 164.35 user 9.88 sys`	2m25s	slower, but good
`17.0.4.1-tem`	`122.72 real 133.24 user 11.10 sys`	2m3s	slower, but good
`17.0.2-tem`	`124.16 real 135.05 user 11.84 sys`	2m4s	well, well!
`17.0.3-tem`	`127.53 real 137.23 user 11.88 sys`	2m8s	huh!

So the parallelism is greatly adversely affecting the buggy JDKs.

Parallelism Enabled (sanity rerun)

Just to make sure, I'll rerun one of the prior tests with parallelism enabled:	sdk use java	timing results	human friendly elapsed	judgement
`17.0.3-tem`	`3005.35 real 5028.85 user 8029.54 sys`	50m5s	reproducible eek!

Options

Some options/ideas:

Like you suggested, document in README
Automatically default parallelism to off for known problematic JDK versions.
Turf parallelism entirely, is the performance gain worth the complexity?
Consider something with more control for parallelism. I wonder if we might be able to compensate for the buggy JDKs by throttling back parallelism a bit? Maybe experiment with claypool or something similar?
Look at alternate ways to optimize. I've not looked the code deeply at all yet, maybe there we could be doing things differently?

I think option 1 is a no-brainer. Option 2 should be relatively easy and avoid support questions. Options 4 and 5 intrigue my curiosity.

lread commented 2 years ago

I'm getting a better understanding of the work that MrAnderson is doing.

It is repeating a lot of file slurp and parsing that maybe it does not have to.

Idea 1. I'm thinking it would be interesting to explore analyzing all that work needs to be done first, then applying that work in one pass (or a few if that makes more sense) of the source tree.

Idea 2. It might also be interesting to explore employing clj-kondo data analysis. It could tell us what namespaces are used where. It might not help with the imports much (or at all), but I think they are already handled pretty efficiently.

I'll start exploring idea 1. Idea 2 looks interesting but does bring in more deps.

benedekfazekas commented 2 years ago

I guess the keys fns here to look at are mranderson.move/replace-ns-symbol and mranders.move/update-file. This ns is relatively well documented and yes there are many runs as mranderson needs go through the dependency tree and replace references to deps vars again and again in the deps files it already processed for a previous dependency. maybe there is a way to simplify this (therefore making it more effective) with clj-kondo

lread commented 2 years ago

Even without clj-kondo, I think we could work out the list of renames we need to make first, then apply them all at once to each file. Right now I think renames are applied one at time to each file.

I'm thinking that instead of this pseudo-code: (for [r renames f files] (-> f slurp parse (update-with r) spit))

We would instead (for [f files] (-> f slurp parse (update-with renames) spit))

Which would mean a lot less slurping parsing and spitting (somebody who doesn't know Clojure would now be grossed out!).

I'm guessing this would probably make the pmap unnecessary.

benedekfazekas commented 2 years ago

slurp, spit: 🤣 ah maybe I misunderstood you previously as I thought you meant the new rounds of processing for every dependency. this might work and sounds the right thing to do.

lread commented 2 years ago

Yeah, I don't think I was entirely clear. Pseudocode helps!

I'll give idea 1 a whirl and learn what, if any, downsides there are to this approach.

lread commented 2 years ago

I got distracted by some other work, I am looking forward to starting back on this very soon.

benedekfazekas commented 2 years ago

thanks, all your contributions here are appreciated already, following rewrite-clj i realise you are quite busy elsewhere too

benedekfazekas / mranderson