cue-lang / cue

The home of the CUE language! Validate and define text-based and dynamic configuration
https://cuelang.org
Apache License 2.0
5.09k stars 290 forks source link

Storing and querying CUE in database #1435

Open codewithcheese opened 2 years ago

codewithcheese commented 2 years ago

At balena.io we make extensive use of manifest type files we call contracts, they are a mix of data and schema, and we have an internal database for storing and querying contracts. Contracts are stored in JSONB fields and schema is defined using JSONSchema. I opened a discussion about this see #1250.

We are keen to adopt CUE for contracts for a few reasons. Mixing data and schema with CUE is more natural and much less verbose than JSONSchema. CUE has some nice ergonomic improvements over plain JSON (see #130), like no requirement to quote property names and inline objects with a single property. Also, contracts can be evaluated together to validate compatibility, solve constraints and generate new output formats and CUE has some nice properties for evaluation vs JavaScript with JSON/JSONSchema.

As far as I know there is no way to encode and store complex CUE files include expressions and literals as data, so that concrete values can be queried and resulting files can be decoded back to CUE and expressions evaluated.

CUE to JSONSchema is an option, but it would limit encoding and decoding to what can be expressed as JSONSchema.

I propose an (en|de)coder that can convert any arbitrarily complex CUE file to a plain object format, with concrete data as literals and CUE expressions serialized to string and appended as a special properties for later decoding. Internally I am calling this format CUEdata but I am totally open to any name, such as those mentioned in #130, QSON or CUEL.

We do still have a need for JSONSchema, contracts can also include queries that are defined using a schema, we have a JSONSchema to SQL compiler. I propose an attribute could be used to control how a struct is encoded. When a format attribute is specified; CUE expressions are encoded to that format instead of being serialized to a syntax string.

cc @myitcv @mpvl

mpvl commented 2 years ago

Internally there is a definition of data-only CUE either fully expanded (data) and allowing references only (graph).

There is no file extension for these, but you can invoke them by specifying output or input types. For instance,

cue def foo.cue --out cue+data

is really like cue export.

Similarly

cue eval data: foo.cue

checks that the input has no references or expressions, while

cue eval graph: foo.cue

checks that the input has no expressions other than literals and references.

How is this different from what you want? Can you give examples of how this is different?

codewithcheese commented 2 years ago

@mpvl I see I was not clear enough. I have updated the problem statement:

As far as I know there is no way to encode and store complex CUE files include expressions and literals as data, so that concrete values can be queried and resulting files can be decoded back to CUE and expressions evaluated.

We want to store CUE expressions alongside literals, so when the data is retrieved it can be decoded back to CUE and evaluated. A JSONSchema encoder world allow for the encoding of a subset of expressions, the format I am proposing would allow for encoding all CUE expressions.

For example:

name: "example"
author: "codewithcheese"
uri: "\(envVars.HOSTNAME):\(envVars.PORT)"
envVars: {
    HOSTNAME: string
    PORT: number
    CONSTANT: 1
}

Would be encoded as follows, which can be stored in JSON field and queried for any literal like name, author, envVars.CONSTANT

name:   "example"
author: "codewithcheese"
envVars: {
    CONSTANT: 1
    $$cue:    "HOSTNAME: string\nPORT: number"
}
$$cue: "uri: \"\\(envVars.HOSTNAME):\\(envVars.PORT)\""

The result could be decoded and unified with other results that specify values for HOSTNAME and PORT and the uri expression evaluated.

JSONSchema encoder could encode envVars, but it cannot handle more complex CUE expressions like uri (if there is a way to do string interpolation that I am not aware of then imagine the expression is more complex).

See testdata in the PR for more examples.

verdverm commented 2 years ago

@codewithcheese what about storing the concrete values which can be queried for in a jsonb column and storing the original CUE source in a second, text column?

codewithcheese commented 2 years ago

@verdverm that's an option. I think that's a more complex solution than using an encoding. Both options (encoding or separate field) require a change to our data handling, however a separate field also requires a migration.

It would also mean the concrete values could get out of sync with the original source if an update was applied to one of the concrete values. Might require some extra logic. We're not likely to apply updates to the concrete values alone but its a potential pitfall down the track.

mpvl commented 2 years ago

The use case is still mostly unclear to me, but I see what your are proposing.

The semantics of this seems somewhat peculiar and specific to a certain job. There are also many aspects to consider, which would make this a massive design undertaking, as far as I can tell. For instance, why $$cue and why not attributes? Why not store it as JSON? If this is because of easy of parsing, why not have an interpreter that simply treats any non-number or non-string as an opaque expression, or drops it from the input? That would allow a relatively simple parser for regular CUE without any rewriting (note that the CUE lexer+parser is already considerably simpler than a YAML lexer+parser) . There seem to be too many specific design choices related to a custom case to warrant making this a standard thing.

Instead, it seems that users can fairly easily writing something themselves that converts an evaluated CUE value to such a format, tailored to their purpose. Alternatively, one could just store CUE as is, and have a small tool to strip any expressions from the AST before evaluation (see ast/astutil). This would just be a few lines of code and be a cleaner alternative, IMO.

So unless someone comes up with a good argument that this is a generally useful, with good justifications why a design should be exactly as proposed, and come up with an easy set of principles from which a design flows naturally, this does not seem to be something that belongs in core.

codewithcheese commented 2 years ago

@mpvl I think there is still a misunderstanding, or more likley there is something about CUE I am totally misunderstanding. I am keen to continue the discussion to get on the same page if that's ok.

I am fine for this not to be included in core, we can use it as a separate tool (if it still makes sense). My intention with the PR is to see if it something that is generally useful.

For instance, why $$cue and why not attributes? Why not store it as JSON?

$$cue is only mean to function as a field label, for a field with a syntax string value. It's not a new special identifier and only has meaning to the encoder. "CUEdata" is not really meant to be used as CUE but encoded as JSON or YAML. The goal is to make an arbitrary CUE file concrete by rewriting expressions as syntax strings while maintaining structs and their literal values, that way it can be exported as a data format like JSON and stored as structured data in a DB.

As far as I can see attributes are ignored when exporting to JSON, the purpose is to do the very opposite, to maintain the expressions, as syntax strings, in JSON. That way can they be decoded and evaluated later.

Alternatively, one could just store CUE as is, and have a small tool to strip any expressions from the AST before evaluation (see ast/astutil). This would just be a few lines of code and be a cleaner alternative, IMO.

As far as I understand CUE is not "code as data", unlike a Lisp for example. How would you store CUE file containing literals and expressions "as is" in a database so that the literal values may be queried?

verdverm commented 2 years ago

How would you store CUE file containing literals and expressions "as is" in a database so that the literal values may be queried?

The answer for CUE is probably similar to other unsupported formats like Yaml or Dhall.

codewithcheese commented 2 years ago

@verdverm seems you understand what I am getting at, what do you consider to be the pros and cons of encoding to JSON with embedded syntax strings vs 2 column solution?

verdverm commented 2 years ago

@codewithcheese the embedded idea seems arbitrary and hacky imho. I generally agree with @mpvl that something like this would require a lot more flushing out.

I would think there might be something around doing this more generally within a CUE value and then being able to support this in both Yaml and JSON. Still, this would seem better as a library than in core.

codewithcheese commented 2 years ago

@mpvl it seemed there may of been some misunderstanding of the design I proposed, I tried to clarify a few messages above. Is there anything else you would like to add before we close this issue?