Closed JaroslavTulach closed 8 months ago
The additional startup time seems to be caused by the JVM boot module layer initialization. To proof that, I have designed an experiment at https://gist.github.com/Akirathan/9c30e10ec525eb09c91c3c94e2c36f40 that repeatedly invokes java --module-path <component-directory> -m org.enso.runtime/EngineBootLoader --help
(i.e. enso --help
) that does not do any polyglot stuff, just loads a bunch of classes and prints help. There are some comments in the experiment explaining the results.
In other words, if we exclude graalpy-language-23.1.0.jar
and its dependencies from --module-path
, the startup time is 852 ms, if we include graalpy, the startup time is 1178 ms.
GistGraalPy startup experiment. GitHub Gist: instantly share code, notes, and snippets.
As noted in https://gist.github.com/Akirathan/9c30e10ec525eb09c91c3c94e2c36f40?permalink_comment_id=4808928#gistcomment-4808928, the total size of all the GraalPy Jars is 201 MB. All the GraalPy Jars that are additionally included are:
['bcpkix-jdk18on-1.76.jar', 'bcprov-jdk18on-1.76.jar', 'bcutil-jdk18on-1.76.jar', 'insight-tool-23.1.0.jar', 'llvm-api-23.1.0.jar', 'python-language-23.1.0.jar', 'python-resources-23.1.0.jar', 'truffle-nfi-23.1.0.jar', 'truffle-nfi-libffi-23.1.0.jar', 'xz-1.9.jar']
How to inspect what does JVM do with these Jars during boot? Please help @JaroslavTulach.
GistGraalPy startup experiment. GitHub Gist: instantly share code, notes, and snippets.
Pavel Marek reports a new STANDUP for yesterday (2023-12-27):
Progress: - Experimenting and measuring JVM startup
GistGraalPy startup experiment. GitHub Gist: instantly share code, notes, and snippets.
Pavel Marek reports a new STANDUP for today (2023-12-28):
Progress: - Format also Java sources in runtime/bench
- #8638
EqualsBenchmarks
- #8636How to inspect what does JVM do with these Jars during boot?
Please use --profiling-path option.
What happens during the startup? I am not sure, but one of the hypothesis is: Python unpacks its resources. If so, then the solution would be somewhere along this copyResources method and the properties it mentions polyglot.engine.resourcePath
& co.
How to inspect what does JVM do with these Jars during boot?
Please use --profiling-path option.
--profiling-path
is only usable when you actually run some Enso code. If you just run enso --help
(or enso --version
), the profiler will not even start, resulting in an empty profiling file. The thing is that the slowdown is in the JVM initialization, without any org.graalvm.polyglot.Context
.
I have a strong suspicion that the slowdown happens before the main
method is executed.
I have decided to do additional experiments with a combination of -Xlog
flags (https://docs.oracle.com/en/java/javase/21/docs/specs/man/java.html#enable-logging-with-the-jvm-unified-logging-framework) (All these experiments run just enso --help
):
-Xlog:class+load=trace
to look into all the loaded classes. There doesn't seem to be a significant difference between how many classes are loaded: 3011 vs 3116. No classes from any graalpy
module nor its dependencies are loaded, obviously.
-Xlog:module+load=trace
to look into the modules loading. Number of loaded modules differ, obviously. The duration of loading python-language
, or python-resources
, or any other graalpy-related module is negligible. The log files generated with something like:
java --module-path <comp-dir> -Xlog:module+load=trace:file=small
-modules.txt:uptime,timenanos,level,tags -m org.enso.runtime/org.ensoEngineRunnerBootLoader --help
are small-modules.txt and big-modules.txt.
In these log files, we can see that the JVM loads java.base
module as the very first module and that there is a delay before the second module is loaded. This delay is different for small-modules and big-modules. In small-modules.txt
, the uptime of the second loaded module is 0.333s
, which in big-modules.txt
, the uptime of the second loaded module is 0.555s
. So what happens after java.base
module is loaded and before the second module is loaded, how do we find that out? The difference is 220 ms!
What happens during the startup? I am not sure, but one of the hypothesis is: Python unpacks its resources. If so, then the solution would be somewhere along this copyResources method and the properties it mentions
polyglot.engine.resourcePath
& co.
As mentioned in my previous comment, I am running these experiments with just enso --help
. So no org.graalvm.polyglot.Context
gets initialized. So no EnsoLanguage
, neither PythonLanguage
classes get loaded. The slowdown happens somewhere in JVM internals during boot, before main
method is executed.
I am running these experiments with just
enso --help
.
The real benchmarks can be executed as:
./built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/bin/enso --run test/Benchmarks/ Startup
If you want to run exactly what "empty_startup" benchmark runs, then do:
./built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/bin/enso --run test/Benchmarks/src/Startup/Empty_World.enso
if you add --profiling-path slow.npss
to the above command, you'll see what the JVM does. I am attaching two .npss
files FastAndSlow.zip for your convenience. They differ by 400ms. Now the question is why and where do they differ?
The slow.npss
file seems to spend 100ms in JAR verification. Module constructor seems to be slower:
This is probably caused by having some signed JAR in the system. The JDK tries to verify it and slows things down.
The
slow.npss
file seems to spend 100ms in JAR verification. Module constructor seems to be slower:This is probably caused by having some signed JAR in the system. The JDK tries to verify it and slows things down.
I can confirm this observation. The signed JARs are bcpkix-jdk18on-1.76.jar
, bcprov-jdk18on-1.76.jar
and bcutil-jdk18on-1.76.jar
. These JARs were added as separate modules in #8351. Moreover, org.enso.runtime
module has declared dependency in https://github.com/enso-org/enso/blob/21d164ec3eb37b9eff0e030146dff8963de31ca8/engine/runtime-fat-jar/src/main/java/module-info.java#L11 because we use it for simple SHA1 digest computation in org.enso.interpreter.caches.ModuleCache
. We already have a task to remove this dependency - https://github.com/enso-org/enso/issues/8362.
I have also tried to manually remove signatures from these Jar archives, but failed to do so. I have tried to remove checksum entries from META-INF/MANIFEST
as well as *.DSA
and *.SF
files from META-INF
, but enso still failed on ModuleInitializationError
.
Previously (before #8351), the bouncycastle modules used to be assembled in runtime.jar
fat jar, so there was no signature checking by the JVM.
Pavel Marek reports a new STANDUP for today (2024-01-02):
Progress: - The cause of the slowdown is most-likely the introduction of signed Jar archives (org.bouncycastle).
org.bouncycastle.provider
- https://github.com/enso-org/enso/issues/8362 It should be finished by 2024-01-05.org.bouncycastle
dependencies were removed in #8664, but did not provide any benchmarks improvements.
Pavel Marek reports a new STANDUP for today (2024-01-04):
Progress: - Investigating another problems with benchmarks in #8668
org.bouncycastle
from runtime dependency (#8664). Didn't help with the startup time though. It should be finished by 2024-01-05.if you add
--profiling-path slow.npss
to the above command, you'll see what the JVM does. I am attaching two.npss
files FastAndSlow.zip for your convenience. They differ by 400ms. Now the question is why and where do they differ?
Here is a 3rd ZIP with t.npss file. It runs in 1800ms - e.g. just in the middle of FastAndSlow.zip. Looks like the #8664 fix helped to give us 50% of the time back. We still need to find the next 50%.
Also reported
Another startup initialization issue revealed: https://github.com/enso-org/enso/issues/8690. It's not related to GraalPy though.
org.bouncycastle
dependencies were removed in #8664, but did not provide any benchmarks improvements.
Edit: Removing org.bouncycastle
helped the startup by 10%. Let's close this issue and continue with #8690.
Pavel Marek reports a new STANDUP for today (2024-01-05):
Progress: - language-server/test
still reassembles runtime fat jar.
SQLlite_Spec
to the new test builder API. It should be finished by 2024-01-05.
The stdlib startup benchmarks indicate startup time regression for:
Startup_empty_startup
- 16%Startup_hello_world_startup
- 12%The regression happened between Dec 1 and Dec 4 with the only two integrations being:
8432
8351
Given the #8432 is only a test change, the slowdown is probably related to #8351. Why? Python isn't used during the startup at all - e.g. there is no reason for it to slow startup down!