databricks / sjsonnet

Apache License 2.0
267 stars 55 forks source link

*New* Add parseYaml() Function #142

Closed agrawroh closed 1 year ago

agrawroh commented 2 years ago

Description

This PR adds a new parseYaml() function to sjsonnet.

JSONNET Reference Doc


Signed-off-by: Rohit Agrawal rohit.agrawal@databricks.com

lihaoyi-databricks commented 2 years ago

@szeiger could you review this? since you guys own sjsonnet these days :)

szeiger commented 2 years ago

Can we at least skip the intermediate JSON serialization and translate the SnakeYAML AST directly? If we add YAML parsing and use it in our codebase, I expect performance to be an issue. Ideally, we'd parse directly to an Sjsonnet AST (like in ValVisitor for parsing JSON) but I don't know if SnakeYAML supports that. I tried to look it up but the project appears to be inaccessible in a private BitBucket repository, which is a bit worrisome for a new dependency. Do you know more about the status of SnakeYAML?

agrawroh commented 2 years ago

Can we at least skip the intermediate JSON serialization and translate the SnakeYAML AST directly? If we add YAML parsing and use it in our codebase, I expect performance to be an issue. Ideally, we'd parse directly to an Sjsonnet AST (like in ValVisitor for parsing JSON) but I don't know if SnakeYAML supports that. I tried to look it up but the project appears to be inaccessible in a private BitBucket repository, which is a bit worrisome for a new dependency. Do you know more about the status of SnakeYAML?

@szeiger I don't know much about SnakeYAML except that it appears in almost all the threads I could find for parsing YAML in scala. It's also used at some other places in our universe. I could also find this other library which converts SnakeYAML AST to Circle AST (JSON) [Ref] My use-case was to parse and validate the GOC file(s) which are YAMLs inside my service's JSONNET template. I am not sure if we'd be using the YAML parsing at a lot of other places but the performance could be a concern.

JSONNET seems to be using this CPP library called RapidYAML for implementing the parseYaml() function.

Any suggestions on how we can implement this?

szeiger commented 2 years ago

@agrawroh I also found most recommendations pointing to SnakeYAML. I was surprised to see that it doesn't have a public repo or docs. Maybe this is a recent change?

Transforming the AST for at least decent performance is probably easy. It looks like SnakeYAML already uses normal Java maps for objects. The primitives are probably simple wrappers, similar to a JSON AST. This shouldn't take more than a few lines of code for a huge performance improvement.

The ability to easily skip the SnakeYAML AST entirely depends on the library's design (which is why I was looking for docs). uPickle makes this very easy (which allows us to implement a JSON to Sjsonnet AST parser with no intermediate JSON AST in 46 LOC). It may be harder to do and the expected pay-off is much smaller.

agrawroh commented 2 years ago

@szeiger BTW Did you try using this link to access SnakeYAML repository? https://bitbucket.org/snakeyaml/snakeyaml/src/master/

szeiger commented 2 years ago

Thanks, that link works. Every page I found so far referred to https://bitbucket.org/asomov/snakeyaml

agrawroh commented 1 year ago

Stamping to get this unblocked. My concerns about the performance still stand. We should at least skip the intermediate JSON serialization and deserialization. If this ends up being used with large YAML files, we're likely to run into performance issues with the current design.

Will try to get this in to unblock Cloud Team. I will have a few free cycles next month. Would definitely try to work on the performance optimizations for this as well as compression which started taking a lot of time.