PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.94k stars 218 forks source link

prql-compiler target (dialect) extension system (like Pandoc's extension) #3655

Open eitsupi opened 1 year ago

eitsupi commented 1 year ago

Pandoc has features that allow we to easily enable and disable individual features to support various Markdown flavors.

The behavior of some of the readers and writers can be adjusted by enabling or disabling various extensions.

An extension can be enabled by adding +EXTENSION to the format name and disabled by adding -EXTENSION. For example, --from markdown_strict+footnotes is strict Markdown with footnotes enabled, while --from markdown-footnotes-pipe_tables is pandoc’s Markdown without footnotes or pipe tables.

The markdown reader and writer make by far the most use of extensions. Extensions only used by them are therefore covered in the section Pandoc’s Markdown below (see Markdown variants for commonmark and gfm). In the following, extensions that also work for other formats are covered.

Note that markdown extensions added to the ipynb format affect Markdown cells in Jupyter notebooks (as do command-line options like --markdown-headings).

https://pandoc.org/MANUAL.html#extensions

Similarly, it would be great if prql-compiler had an interface to change the features of the target dialect.

For example, turn off converting to DISTINCT ON on sql.duckdb:

prql target:sql.duckdb-distinct_on

from employees
group city (
  take 1
)
select last_name

If we have a feature like this, we can easily define something like "Redshift dialect, that is similar to Postgres dialect but without DISTINCT ON". It also works as a workaround for issues #3111.

max-sixty commented 1 year ago

Yes, I think we could have something like a feature flag, a bit like rust's nightly features. This would also let us add experimental features, such as loop in its early days, with fewer stability guarantees (particularly as we want to offer more stability guarantees to the core language going forward).

It does increase the cardinality of the compiler, so I would vote to limit it to newer features and, unless we really find cases that require it, remove the features once we decide whether to stabilize them or not (again similar to rust).

I would probably not have it combined with the target, so instead it might look something like:

prql target:sql.duckdb features:["no-distinct-on"]
...
eitsupi commented 1 year ago

Feature flags for the compiler as also suggested in #3481 would definitely be useful, but I think that is a bit different from my requirements. In other words, the fact that Redshift does not adopt DISTINCT ON has nothing to do with the stability of prql-compiler functionality.

max-sixty commented 1 year ago

In other words, the fact that Redshift does not adopt DISTINCT ON has nothing to do with the stability of prql-compiler functionality.

Yes good point, I agree it's quite different.

Re having lots of flags to turn on/off features — I worry that it would cause high cardinality in the compiler. Instead of needing to be correct for n dialects, we'd need to be correct for 2^m options (and unlike with unstable flags, m would accumulate, and have stability guarantees (when we get those)).

How do we trade it off against adding more dialects? Ideally we could add these without too much work — can we add a redshift one by copy-pasting the postgres and changing distinct on? Or is there a long tail of these such that we do need to be able to turn on & off individual features?

eitsupi commented 1 year ago

How do we trade it off against adding more dialects? Ideally we could add these without too much work — can we add a redshift one by copy-pasting the postgres and changing distinct on? Or is there a long tail of these such that we do need to be able to turn on & off individual features?

Considering the flexibility, it would be nice to be able to define the dialect in a definition file, like a function module.

In other words, the following

https://github.com/PRQL/prql/blob/fb6dd86465304ac9ab53a46adbd05a394649a3f9/crates/prql-compiler/src/sql/dialect.rs#L187-L196

...It can be defined in a yml file like this:

postgres:
  requires-quotes-intervals: false
  supports-distinct-on: true

Maybe a dialect module like the function module?

max-sixty commented 1 year ago

I don't have a strong objection...

...but in my experience it often seems most attractive to templatize things when they're changing a lot, which they typically are at the beginning.

But then if they're not expected to change in the future, then the template infrastructure doesn't end up paying for itself — we have a thing that reads yaml and generates traits at compile time! — and it's simpler to have it in native code.

That said, no strong objection. And if we find that we need to make lots of changes to dialects, or we want to allow non-technical folks to create new dialects, then that weights on templatizing it...

eitsupi commented 1 year ago

I'm just writing an idea and I don't want this feature that much, there are not that many SQL dialects and it should be possible to modify the SQL generated by other solutions.

(And I just used YAML for illustrative purposes, not because I want YAML in the end.)