enso-org / enso

Enso Analytics is a self-service data prep and analysis platform designed for data teams.
https://ensoanalytics.com
Apache License 2.0
7.37k stars 323 forks source link

Use SnakeYAML for YAML parsing (instead of circe) #9113

Closed hubertp closed 4 months ago

hubertp commented 8 months ago

Follow up on https://github.com/enso-org/enso/issues/8692 and a related PR.

We use a lot of circe which has a high startup cost upfront due to class loading. Behind the scenes circe uses SnakeYAML engine for YAML parsing and we could potentially cut some startup time by using it directly with Java classes (won't work with plain Scala case classes).

JaroslavTulach commented 6 months ago

In addition to startup cost, we should also investigate independence on java.desktop JDK module. Alas SnakeYAML seems to require java.desktop:

enso$ rm -rf jdk; ~/bin/graalvm/bin/jlink --output jdk --add-modules java.net.http,jdk.unsupported,java.sql,jdk.management,jdk.jfr; ./jdk/bin/java --list-modules

# remove Python, JavaScript, regex:

enso$ rm ./built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/component/python-* ./built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/component/js-language-24.0.0.jar ./built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/component/icu4j-24.0.0.jar ./built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/component/regex-24.0.0.jar ./built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/component/bc*

enso$ ./built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/bin/enso --run test/Base_Tests/Exception in thread "main" java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:118)
        at java.base/java.lang.reflect.Method.invoke(Method.java:580)
        at org.enso.runtime/org.enso.EngineRunnerBootLoader.main(EngineRunnerBootLoader.java:46)
Caused by: java.lang.NoClassDefFoundError: java/beans/IntrospectionException
        at org.yaml.snakeyaml.constructor.BaseConstructor.getPropertyUtils(BaseConstructor.java:645)
        at org.yaml.snakeyaml.constructor.Constructor.<init>(Constructor.java:60)
        at org.yaml.snakeyaml.Yaml.<init>(Yaml.java:82)
        at io.circe.yaml.Parser.$anonfun$parseSingle$1(Parser.scala:39)
        at io.circe.yaml.parser.package$.parse(package.scala:18)
        at org.enso.pkg.Config$.fromYaml(Config.scala:273)
        at scala.util.Using$.apply(Using.scala:113)
        at org.enso.pkg.PackageManager.readConfig$1(Package.scala:340)
        at org.enso.pkg.PackageManager.loadPackage(Package.scala:346)
        at org.enso.runner.Main.run(Main.java:694)

Isn't there some other, more modern library that would work with JavaBeans dependency?

hubertp commented 6 months ago

Isn't there some other, more modern library that would work with JavaBeans dependency?

Looks like YamlBeans is the only alternative.

JaroslavTulach commented 6 months ago

Looks like SnakeYML is ready to run on Android and avoid usage of java.beans package altogether. As such it should run without java.desktop too - it is just a matter of configuring it.

Note: Of course, such a configuration may require changes in upstream project to allow it, but maybe not. The construction stack is:

"main"
    at org.yaml.snakeyaml.introspector.PropertyUtils.<init>(PropertyUtils.java:45)
    at org.yaml.snakeyaml.constructor.BaseConstructor.getPropertyUtils(BaseConstructor.java:645)
    at org.yaml.snakeyaml.constructor.BaseConstructor.addTypeDescription(BaseConstructor.java:664)
    at org.yaml.snakeyaml.constructor.Constructor.<init>(Constructor.java:134)
    at org.yaml.snakeyaml.constructor.Constructor.<init>(Constructor.java:79)
    at org.yaml.snakeyaml.constructor.Constructor.<init>(Constructor.java:60)
    at org.yaml.snakeyaml.Yaml.<init>(Yaml.java:82)
    at io.circe.yaml.Parser.$anonfun$parseSingle$1(Parser.scala:39)
    at io.circe.yaml.Parser$$Lambda/0x00007fdb0c433680.apply
    at cats.syntax.EitherObjectOps$.catchNonFatal$extension(either.scala:391)
    at io.circe.yaml.Parser.parseSingle(Parser.scala:39)
    at io.circe.yaml.Parser.parse(Parser.scala:29)
    at io.circe.yaml.parser.package$.parse(package.scala:18)
    at org.enso.distribution.config.GlobalConfigurationManager$(GlobalConfigurationManager.scala:88)
    at org.enso.distribution.config.GlobalConfigurationManager.getConfig(GlobalConfigurationManager.scala:28)
    at org.enso.editions.updater.EditionManager$.makeEditionProvider(EditionManager.scala:55)

if we remove io.circe.yaml, then we need to invoke Yaml(BaseConstructor constructor) ourselves and provide instance of PropertyUtils that would have setBeanAccess to FIELD.

JaroslavTulach commented 5 months ago

The pull request

comes with a workaround around current limitations of SnakeYaml. There is a PR updating the upstream project

once it is integrated we'll have to update to new version of SnakeYaml (probably 2.x). I am not sure whether io.circe can work with 2.x versions or if they require 1.x version. Should there be any incompatibilities, then the simplest solution to using latest SnakeYaml with the fix is to get rid of io.circe.

JaroslavTulach commented 5 months ago

SnakeYaml fix got merged:

It will be delivered in version 2.3 https://bitbucket.org/snakeyaml/snakeyaml/wiki/Changes

Please:

snakeyaml / snakeyaml / wiki / Changes — Bitbucket
hubertp commented 4 months ago

I've been doing more profiling and experimenting with replacing circe yaml parsing. But I guess for now we can stay with snakeYAML only that we will need to write encoders/decoders on our own. But circe-yaml has to go: Screenshot from 2024-06-13 14-31-13 Screenshot from 2024-06-14 09-42-17

As one can easily see, SnakeYAML takes only a fraction of the whole circe parsing.

enso-bot[bot] commented 4 months ago

Hubert Plociniczak reports a new STANDUP for the provided date (2024-06-17):

Progress: Investigation into using some out of the box solution for our YAML parsing that does not involve circe failed. Notice that SnakeYAML's API is really lacking in documentation. Will need to dig into source code of the one of the libraries based on it to write a custom parser. Issues on a local, under-powered machine seem to be resolved but cloud still non-functional. Filled #10302 as a blocker. It should be finished by 2024-06-20.

Next Day: Next day I will be working on the #9113 task. Write custom YAML parser.

enso-bot[bot] commented 4 months ago

Hubert Plociniczak reports a new STANDUP for the provided date (2024-06-18):

Progress: Implementing our own parser based on SnakeYAML data structures. Looks very promising. It should be finished by 2024-06-20.

Next Day: Next day I will be working on the #9113 task. Continue writing the parser.

enso-bot[bot] commented 4 months ago

Hubert Plociniczak reports a new STANDUP for yesterday (2024-06-19):

Progress: Continued working on the implementation. Managed to parse some trivial Configs. Should have a draft PR soon. Various meetings and syncs. It should be finished by 2024-06-20.

Next Day: Next day I will be working on the #9113 task. Continue writing the parser.

hubertp commented 4 months ago

I wlll create a separate ticket for 2.3 SnakeYAML upgrade.

enso-bot[bot] commented 4 months ago

Hubert Plociniczak reports a new 🔴 DELAY for yesterday (2024-06-24):

Summary: There is 8 days delay in implementation of the Use SnakeYAML for YAML parsing (instead of circe) (#9113) task. It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: Holiday in between, underestimated how much custom existing circe decoders & encoders were.

enso-bot[bot] commented 4 months ago

Hubert Plociniczak reports a new STANDUP for yesterday (2024-06-24):

Progress: Reduced time - testing newly added YAML decoders and fixing bugs. Catching up others work. It should be finished by 2024-06-28.

Next Day: Next day I will be working on the #9113 task. Continue replacing YAML parser

enso-bot[bot] commented 4 months ago

Hubert Plociniczak reports a new STANDUP for the provided date (2024-06-20):

Progress: Added most of decoders necessary for parsing Config YAML. Sent draft PR illustrating the new approach, needs more testing and better coverage of all classes. It should be finished by 2024-06-28.

Next Day: Next day I will be working on the #9113 task. Continue replacing YAML parser

enso-bot[bot] commented 4 months ago

Hubert Plociniczak reports a new STANDUP for the provided date (2024-06-25):

Progress: Full coverage of YAML decoders. Testing revealed some issues. Started implementing encoders, which are also needed. PR reviews It should be finished by 2024-06-28.

Next Day: Next day I will be working on the #9113 task. Continue replacing YAML parser

enso-bot[bot] commented 4 months ago

Hubert Plociniczak reports a new STANDUP for yesterday (2024-06-26):

Progress: Figured out a way to implement encoders more easily by using Java collections (Scala collections are problematic/not easily supported). Removing any remaining locations of io.circe.yaml. It should be finished by 2024-06-28.

Next Day: Next day I will be working on the #9113 task. Continue replacing YAML parser

JaroslavTulach commented 4 months ago

using Java collections (Scala collections are problematic/not easily supported).

+1

enso-bot[bot] commented 4 months ago

Hubert Plociniczak reports a new 🔴 DELAY for the provided date (2024-07-01):

Summary: There is 6 days delay in implementation of the Use SnakeYAML for YAML parsing (instead of circe) (#9113) task. It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: Extended my holiday, underestimated the presence of YAML parsing all over our codebase.

enso-bot[bot] commented 4 months ago

Hubert Plociniczak reports a new STANDUP for the provided date (2024-07-01):

Progress: Discovered more places where we need to remove YAML. Testing performance. Catching up after a short break. It should be finished by 2024-07-04.

Next Day: Next day I will be working on the #9113 task. Continue replacing YAML parser

enso-bot[bot] commented 4 months ago

Hubert Plociniczak reports a new STANDUP for the provided date (2024-07-02):

Progress: Removed all invocations of Circe's YAML parser. We still use circe's JSON library, which explains some "duplicate" decoders/encoders. Confirmed performance bump on startup. Final cleanup to the PR. It should be finished by 2024-07-04.

Next Day: Next day I will be working on the #9113 task. Continue replacing YAML parser

enso-bot[bot] commented 4 months ago

Hubert Plociniczak reports a new STANDUP for yesterday (2024-07-03):

Progress: Removed circe-yaml dependency. Undrafted PR. Trying to remove as much of circe's presence as possible but it is quite entangled. It should be finished by 2024-07-04.

Next Day: Next day I will be working on the #9113 task. Look into other performance problems, multithreaded execution.

enso-bot[bot] commented 4 months ago

Hubert Plociniczak reports a new STANDUP for the provided date (2024-07-04):

Progress: Address remaining PR comments. Extensive testing. PR is done. Continued profiling the rest of startup. It should be finished by 2024-07-04.

Next Day: Next day I will be working on the #9113 task. Continue investigating high-startup cost.