PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.96k stars 218 forks source link

Types #381

Open aljazerzen opened 2 years ago

aljazerzen commented 2 years ago

We have mentioned many times that a type system would be beneficial because it could detect problems with your query as you write it.

I think this is a 0.3 feature, but I'm opening this issue for overview of all problems that a typing checked should detect.

max-sixty commented 2 years ago

I really think this would be excellent. I agree it's for post-0.2 but +1 on opening the issue and adding things as we see them.

I wrote in the PRQL Proposal initially that I thought type inference would be really helpful, but that I didn't think annotations in user code would fit well. I still think we should try and do as much as possible with annotations — generally PRQL code will be quite "close to the data" — people will predominantly be writing it to access a specific, known data source, rather than writing abstract transformation functions; albeit more than SQL allows.

Re how we can learn about types & transformations if not from annotated user code:

Notably annotating the stdlib will require a system and syntax for annotations, even if we don't encourage them in user code.

A type system is difficult both per se, and because of SQL. SQL does lots of implicit conversions — actually '2021-01-01' - '2020-01-01' is allowed, because the strings are implicitly converted to dates!

aljazerzen commented 2 years ago

Well, we can support type annotations on functions, but don't require them. And if they are provided, we do the type checking using information from the DB connection (if provided).

My opinion on implicit type conversions is quite negative. JavaScript is an example of how automatic type conversions can lead to strange corner cases. Instead I would prefer Python/Rust way of type checking: be strict about types, but support operator overloading and easy explicit type conversions.

max-sixty commented 2 years ago

Great, I agree with you on what would be required. Definitely agree re implicit conversions!

I'm imagining the type experience as similar to writing rust inside a function — the compiler knows the types of the inputs already — in rust's case because they're defined in the function parameters, in PRQL's case because the DB can tell us — and can infer almost everything without needing the user to provide them.

The experience I think we should try and avoid is Java — where the inference is (/ was?) very limited and users has to write lots of annotations themselves.


Another grouping on the "types of types":

max-sixty commented 2 years ago

(@aljazerzen I labeled this "semantic", though lmk if "types" isn't actually part of that in your mental model)

rslabbert commented 2 years ago

Heya, really interested in PRQL generally and this feature specifically. Using a tool like dbt for large scale data warehouse projects would be heavily improved through the addition of type checking. Sharing some thoughts I've had on that concept, but I realise some this might be slightly orthogonal to PRQL itself:

1 - Keeping compilation/checking fast by using declaration files

This one is pretty basic, but generally an issue with SQL type checking implementations is that it requires an active connection to the database which can involve a number of slow API calls. This can be sped up by caching but it can be a bit obtuse for the user to know that a cache might have to be refreshed. Additionally, that requires PRQL to know how to connect to all database sources as opposed to just serving as a compiler.

A solution we can borrow from OCaml/Typescript/others is the idea of interface/declaration files. Effectively a stripped down subset of the language that is used to declare that a table exists the types of the columns. These files can easily be auto-generated by tools and because they form part of the codebase (and most likely the VCS) it can be more clear to the user when they are outdated (combined with a CI step that checks for up to date definitions).

While this approach can theoretically result in a situation where the code type checks but the runtime fails, for most real-life projects that should be rare since data tables shouldn't change in destructive ways and the performance benefits (and documentation of source tables) seem to outweigh the cost.

2 - Use declaration files for external functions

This flows as a simple extension from the previous item, but instead of requiring PRQL to know about every Snowflake/BigQuery/etc. function natively, declaration files can be used to declare external functions, similar to Rust's C FFI. The compiler can still bundle a core set of external functions that are known to be supported by different databases, but it easily allows users to cover newly added functions themselves without having to update immediately (or before a PRQL update is available), and it also then handle user-defined functions natively.

3 - Imports

Once declaration files are incorporated, the ability to import them is required. A global include is viable (and would be the preference for say external functions), but allowing per-file imports has a couple of benefits.

This does seem to increase the scope of the project a bit so I would understand if there was hesitance, but the benefits do seem to be enormous.

4 - Flow through type inference

If models can depend on other models and declarations, then in theory, beyond type casts, all types can be automatically inferred in transformation code. This should be possible since transformations are only able to select from tables or models that are already declared and use functions that are fully typed.

Functions defined in PRQL would potentially need to have type specifications, but it's also possible that this can be inferred. If a function or operator is used in a function that requires certain types, the type of the parameter used can be inferred. If the function doesn't do this, the type of the parameter could become a generic type, such that once the function is called, the parameter that is passed in can be directly passed out again.

Examples:

# func identity : 'a -> 'a
func identity x -> x

# func celsius_of_fahrenheit  : number -> number
func celsius_of_fahrenheit temp -> (temp - 32)

# table { a: text, b: number }
from table
derive [
 # a : text
 a = (identity a),

 # b_identity : number
 b_identity = (identity b),

 # b : number
 b = (celsius_of_fahrenheit  b),

 # type checking error: celsius_of_fahrenheit expected parameter temp to be number but got text
 a_fails = (celsius_of_fahrenheit a),
]

I realise this is a bit of a deviation from what has been discussed previously, but I wanted to get my thoughts written down in case it was useful for this project, but also no harm if it's not. I'll join the Discord in case there was a desire to discuss further.

max-sixty commented 2 years ago

Thanks for adding that @rslabbert .

I very much agree with the "declaration files" — column names & types are slow-changing input that would benefit from an abstraction layer like that, as well as making development much easier.

aljazerzen commented 2 years ago

I pretty much agree on everything you wrote. Connecting to variety of databases from compiler could not result in a fast compiler. Also database-specific-stdlib is a good idea. And I'm already working on type inference expressions and functions!

It's great that you've detailed the whole concept here, until now it was scattered over many issues and chats.

snth commented 2 years ago

Thank you @rslabbert . I also agree with everything you wrote. I am interested in the same use case that you describe of large dbt data warehouses and you have laid it out beautifully.

I tried to express some similar thoughts here but they were not nearly as well thought out or clearly expressed as yours. In my conception of it one would describe the schemas of the source models (in our DWH we already do this in the sources.yml files for documentation purposes) and from that point on all the types are known and could flow through the rest of the models.

If I could vote on next features to implement, getting declaration files for tables, declaration files for external functions, imports and flow through type inference implemented would be top of my list.

Then if we could get dbt to allow .prql files similarly to what they are discussing around dbt Python models then I could see my work getting by using purely PRQL without any need for Jinja (assuming the imports used are sent back to the dbt API so that it can track the lineage of the models).

vanillajonathan commented 1 year ago

I don't like name varchar because it is a weird old name not used outside of SQL, I would much prefer the name string.

Same goes for names like BIGINT TINYINT, SMALLINT, etc. I would much prefer int8, int16, int32 and int64.

snth commented 1 year ago

Is there a universal mapping between TINYINT, SMALLINT, and BIGINT and the corresponding number of bytes? From what I recall the meaning can be database specific.

vanillajonathan commented 1 year ago

I don't know, which is another reason I don't like them; the ambiguity.