Provide some support for startup-time optimization

Krever commented 10 months ago

Problem description scala-cli is a great enabler for writing python-like CLI apps. In some scenarios, where changes to the code are frequent enough, it is more convenient to rely on scala-cli run and distribute the app as source code rather than distribute the app as a binary. It would be nice to have some sort of dedicated support for this use case.

Describe the solution you'd like The bare minimum would be to have a documentation page describing best practices for optimizing scala-cli run for startup time.

The things I have used so far:

offline mode - so we don't do any network calls
-J -XX:TieredStopAtLevel=1 - so the JVM compiler doesn't do upfront optimisations

On top of that, I expect there are some further tweaks you can suggest?

Describe alternatives you've considered

direct support for nailgun

Additional context I've run a bit of very unscientific benchmarks using my app. Each run was repeated 20 times. I was running my app that doesn't do anything but had all my dependencies and did command parsing (so it's not really isolated). I was using the following command

scala-cli run --jvm myjvm myapp --power --offline -J myflag

And here are my results

* baseline (temurin:1.17.0.9) - ~2996 ms
* baseline (temurin:1.21.0.1) - ~3147 ms
* baseline (graalvm-java21:21.0.1) - ~3281 ms
* baseline (graalvm-java17:22.3.3) - ~2900 ms

* -Xshare:on (temurin:1.17.0.9) - ~2805 ms
* -XX:TieredStopAtLevel=1 (temurin:1.17.0.9) - ~2205 ms
* -XX:CICompilerCount=2 (temurin:1.17.0.9) - ~2623 ms
* -Xmx128m (temurin:1.17.0.9) - ~2642 ms

* all the 4 flags above (temurin:1.17.0.9) - ~2238 ms
* -XX:TieredStopAtLevel=1 -Xmx128m (temurin:1.17.0.9) - ~2244 ms

* CDS (-XX:AutoCreateSharedArchive) (temurin:1.21.0.1) - ~2511 ms
* CDS (-XX:AutoCreateSharedArchive) + -XX:TieredStopAtLevel=1 -Xmx128m (temurin:1.21.0.1) - ~2252 ms
* -XX:TieredStopAtLevel=1 (temurin:1.21.0.1) - ~2232 ms

Gedochao commented 10 months ago

I wonder if --offline is a good idea... unless you pre-download bloop somewhere else. As if bloop isn't available, it'd default to the compiler, which definitely won't be faster.

Thanks for the suggestion, this definitely deserves a separate guide or cookbook with a set of best practices. I don't have a ready list of things to suggest, but we definitely should have one. We'll look into it.

Also, your suggestions for the set of options already looks pretty good (except for --offline, for reasons listed above). If you have the time to spike more on this and write it down into a doc, it would be a great contribution already.

Krever commented 10 months ago

I wonder if --offline is a good idea... unless you pre-download bloop somewhere else. As if bloop isn't available, it'd default to the compiler, which definitely won't be faster.

To make a successful run I have to do "on-line" call first (to get my dependencies). I assume this is enough to get bloop in place.

If you have the time to spike more on this and write it down into a doc, it would be a great contribution already.

I will probably not do that for a few different reasons:

I'm not an expert on JVM, I'd rather not pretend I know what I'm talking about. My investigation was based on "let me spend 3 hours on the weekend and see what's possible through cold googling". I'd rather have someone who knows their way around JVM to contribute here.
I'm not an expert on benchmarking. Whatever we put in place should be verified at least on some popular architectures and benchmarking methology should be reviewed. Its probably good to also save the benchmarking infra somewhere so those can be reproduced (I didn't save mine 😓). I've just run a bunch of commands and measured the time on my machine.

I will add to the initial post one more avenue I investigated in the meantime, which is CDS. It gave some results in java 21 but nothing more than tired compilation did alone.

On a more general note: I hoped a bit that there is something in scala-cli that could help to make it faster. But I know exactly nothing about scala-cli internals.

plokhotnyuk commented 10 months ago

Here is how Scala Native team use async-profiler to profile code generation, optimization, and linking.

Probably similar approach could be used for scala-cli in power mode to profile compilation and running of scripts using JVM.

Gedochao commented 10 months ago

I wonder if --offline is a good idea... unless you pre-download bloop somewhere else. As if bloop isn't available, it'd default to the compiler, which definitely won't be faster.

To make a successful run I have to do "on-line" call first (to get my dependencies). I assume this is enough to get bloop in place.

Yes, that much is true. Just noting, that unless you prepare the workspace ahead of time, --offline might actually make things slower rather than faster by accident, if i.e. Bloop isn't available.

If you have the time to spike more on this and write it down into a doc, it would be a great contribution already.

I will probably not do that for a few different reasons:

I'm not an expert on JVM, I'd rather not pretend I know what I'm talking about. My investigation was based on "let me spend 3 hours on the weekend and see what's possible through cold googling". I'd rather have someone who knows their way around JVM to contribute here.

I'm not an expert on benchmarking. Whatever we put in place should be verified at least on some popular architectures and benchmarking methology should be reviewed. Its probably good to also save the benchmarking infra somewhere so those can be reproduced (I didn't save mine 😓). I've just run a bunch of commands and measured the time on my machine.

You're generally always a discord message away from having a couple pairs of eyes look at a draft, and contributions are always welcome. Just sayin', no pressure.

On a more general note: I hoped a bit that there is something in scala-cli that could help to make it faster. But I know exactly nothing about scala-cli internals.

I don't think we have anything built in to quicken things up out of the box at this time, although when we establish some best practices, perhaps we could aggregate the setup under some option.

VirtusLab / scala-cli

Provide some support for startup-time optimization #2534