feat: CLI command to autogenerate JSON Schema for PL, RQ and lineage

kgutwin commented 4 days ago

This PR adds the command prqlc debug json-schema --schema-type TYPE. When run, this dumps a JSON Schema document for the provided type (currently pl, rq, and lineage).

Example:

github/prql % target/debug/prqlc debug json-schema --schema-type pl | head -30
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "ModuleDef",
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "stmts": {
      "type": "array",
      "items": {
        "$ref": "#/$defs/Stmt"
      }
    }
  },
  "required": [
    "name",
    "stmts"
  ],
  "$defs": {
    "Annotation": {
      "type": "object",
      "properties": {
        "expr": {
          "$ref": "#/$defs/Expr"
        }
      },
      "required": [
        "expr"
      ]

The longer-term goal behind adding this as a feature is to find a way to auto-generate type hints for library integrations (Python, TypeScript, etc.) However, this may also be useful as a debugging or documentation tool.

kgutwin commented 3 days ago

Great feedback! I put together a small test and renamed the flag to --ir-type.

As a related question (towards the underlying purpose of this PR) I am wondering what you think about adding automated code generation to this repo, for the purposes of keeping type hints for the bindings in sync with the gradual evolution of PRQL's IRs. In my main project now, we use Pre-commit hooks to trigger a re-run of client code generation whenever our backend's OpenAPI spec is updated. This works really well, as the hooks ensure that a commit with changes to the backend will also include the corresponding changes to the frontend and other client libraries.

If PRQL were to adopt the same general practice, it would involve adding one or two pre-commit hooks:

Optionally, pre-commit could run prqlc debug json-schema any time there's a change to the IR definition, and the output could be stored in the repo. This would allow casual browsers of the repository to read the JSON Schema without needing to run the tool, and/or the schema could be included in the documentation/web site. But if that doesn't sound useful, this step can be bundled together with the next one.
Pre-commit would also then run any appropriate code generation tools (I'm currently looking at datamodel-code-generator for Python, for example) whenever the schema changes. Code generation from schema is valuable because it imposes no extra runtime overhead for bindings, and it stays in lockstep with versioning, which is important for an actively evolving project like this one.

If you think this approach would work for PRQL, I can put it together as a PR for review. If you have other opinions or thoughts, I'm happy to accommodate. Thanks!

max-sixty commented 1 day ago

(forgive the delay; just got back from vacation)

Optionally, pre-commit could run prqlc debug json-schema any time there's a change to the IR definition, and the output could be stored in the repo. This would allow casual browsers of the repository to read the JSON Schema without needing to run the tool, and/or the schema could be included in the documentation/web site. But if that doesn't sound useful, this step can be bundled together with the next one.

Storing them in the repo is a great idea (one of the reasons we like snapshots a lot!)

pre-commit is a clever way of doing that. One constraint is that cargo run -p prqlc -- debug json-schema won't run in pre-commit-ci, because that doesn't have internet access, which is required for building the crates. So we could instead:

Use standard snaphsots, at the cost of them having them in .snap file with a header (or until we find a solution to https://github.com/mitsuhiko/insta/issues/353#issuecomment-1464986801 over at insta)
Run pre-commit in standard GHA CI

Pre-commit would also then run any appropriate code generation tools (I'm currently looking at datamodel-code-generator for Python, for example) whenever the schema changes. Code generation from schema is valuable because it imposes no extra runtime overhead for bindings, and it stays in lockstep with versioning, which is important for an actively evolving project like this one.

This sounds interesting. I don't have much experience with these so don't know how well they work. IIUC most folks using bindings aren't going deep in the AST (vs. just compiling to SQL), but if this would be helpful + the tools are good — in particular if they don't impose a cost on those just compiling to SQL, would be very open to it.

Thanks @kgutwin !

PRQL / prql

feat: CLI command to autogenerate JSON Schema for PL, RQ and lineage #4698