Type level parsing of raw sql

mikesol commented 2 months ago

I was looking for a PS postgres library and this seems like it could be a really good fit! Is the library still maintained? If so, would you be open to using VTAs instead of proxies in some places?

easafe commented 2 months ago

Hey there, yes, I still use this for https://mero.chat so it is maintained.

The library is older than visible type applications so yeah I am open to anything that could speed up compilation/simplify the API.

mikesol commented 2 months ago

Cool, I'll make a PR this week 👍

easafe commented 2 months ago

Nice!

mikesol commented 2 months ago

Started taking a look this morning 🌞

It's tricky because the proxies are used to build up lots of stuff. For example, there's no clean VTA way to turn u ... id into "u" ... "id" unless you do @("u" ... "id") and make ... an infix type operator

The VTAs would need to apply to even larger types, meaning you'd need @("u" ... "id" # As "id"). That gets thorny because As needs to pull double-duty, meaning both @("u" ... "id" # As "foo") and @("id" # As "foo") are valid. While it's possible to construct these types, it means that the constrains need to move from ToAs on as to whatever is consuming the VTA, ie select. But then you have two problems:

The IDE will underline the select instead of the as.
The compilation error will happen deeper down a constraint hierarchy, which makes it difficult for the consumer to understand what went wrong. To remedy this, you need to do extensive bookkeeping and make a custom Fail.

The endgame is that you wind up basically reconstructing an SQL parser at the typelevel, which is a stone's throw away from parsing a typelevel string containing SQL. The latter could be pretty fun - that's how, for example, I made pursx in Deku. I always copy/paste SQL from ChatGPT, so this'd be a lifesaver.

Anyway, I digress. While VTAs are cool, they create a yak that just gets hairier and hairier the more you try to shave it. What do you think?

easafe commented 2 months ago

So your plan was to use type applications instead of proxies for the selected columns?

While currently every column must have a name the only place that actually requires aliases is when joining. This is a gigantic hack so maybe it could work if you moved all the logic to the last minute checks in ToQuery.

which is a stone's throw away from parsing a typelevel string containing SQL

I remember trying this at first and it being way too slow to compile (and harder to compose). Maybe now it is better? I had a revaluation of the current approach in my todo list for the semi near future. Breaking changes are fine as long as it is still type safe and (somewhat) composable.

mikesol commented 2 months ago

The compiler has improved a lot, pursx compiles reasonably fast.

The nice thing about raw SQL these days is that models like GPT-4 and Claude are trained on bucketloads of it, so you go much faster if you can work with that.

easafe commented 2 months ago

I dunno about llms, but I'd take sql strings just cos it is more familiar to write. I sketched an approach with type level strings queries and vtas for both table definitions and parameters. Whenever I have the time I will try and implement it but it might not be soon

mikesol commented 2 months ago

Nice!

I thought a bit about it as well. Here's my initial musings...

You can define a db's tables like so:

type User = { id :: String, active :: Boolean, username :: String }
type Friendship = { id :: String, user1 :: From User "id", user2 :: From User "id" }

And then construct a world that contains all of the tables as proof:

db = world @(User : Friendship : Nil) -- typelevel list

Then, at some point you ask GPT to construct your query: https://chatgpt.com/share/02e8d2b9-cf73-4ab6-9435-f09adfb1ebe3.

Paste it in:


res = query db @"""SELECT
    u1.id AS user_id,
    u1.username AS username,
    u2.id AS friend_id,
    u2.username AS friend_username
FROM
    Friendship f
JOIN
    User u1 ON f.user1 = u1.id
JOIN
    User u2 ON f.user2 = u2.id
WHERE
    u2.active = true;
"""

And `res` would automatically be of type `Array { user_id :: String, username :: String, friend_id :: String, friend_username :: String }` from introspecting the query.

GPT basically never gets these queries wrong, so it may be enough just to focus on the columns to construct the return type without needing to validate any of the other stuff.

mikesol commented 2 months ago

Actually, even what I'm proposing above may be overkill. If you do correct RAG, you don't even need that. Check out https://chatgpt.com/share/4289e5f6-72e0-450a-812e-e0fa8aa086e7 . It correctly generates:

query db @(Array { id1 :: String, id2 :: String }) @"""
  SELECT f.user1 as id1, f.user2 as id2
  FROM Friendship f
  JOIN User u ON f.user2 = u.id
  WHERE u.active = true
"""

At which point the PS framework can be super small.

easafe commented 2 months ago

If you just need something to automatically parse query results into objects there already exists Droplet.Driver.Unsafe that does exactly that.

I was thinking of how to keep the interface type safe but switch out to type level strings. The table and columns and whatnot could stay as it is. I think the only new heavy lifting would be parsing the sql query string.

mikesol commented 2 months ago

Yeah, exactly. The table stuff wouldn't change much. The typelevel parser would be annoying for the reason writing any parser by hand is annoying. If you get it into an a typelevel ADT, validating it against a schema and generating the return type should be less painful.

easafe commented 2 months ago

validating it against a schema

Do you mean querying the database schema?

mikesol commented 2 months ago

Yeah, for example if you parse the typelevel String into a typelevel ADT that has a SELECT statement of columns foo and bar from table baz, the next step (still at the typelevel) would be to look at the schema and verify there is in fact a table baz with columns foo and bar (and also check their types if there are where statements or joins).

easafe commented 2 months ago

Ah okay, so how it already works. I thought you meant asking the actual database, which is a thing I saw some haskell library do

mikesol commented 2 months ago

Today I started putting together a little project to rewrite the HTML parser in Deku. It's going to be a full-fledged typelevel parser. When I finish it (maybe over the weekend) take it on a spin and see if it works with a rudimentary SQL ADT. If so, it'd be great if a single typelevel parser could be used for both projects.

Here's the repo: https://github.com/mikesol/purescript-tldr

There are a few tests already of the combinators and matchers.

easafe commented 2 months ago

Nice! That looks great. I am a bit busy with https://github.com/easafe/purescript-flame/pull/87 but after that I will try it for sure

mikesol commented 2 months ago

Just finished the initial work on the repo. It's completely undocumented, so whenever you have time to dip into it, let me know and we can sync on how it works. That'll be a good time for me to add docs and format error messages a bit better. There's tests for everything, though.

A few things to know:

The API tries as best as possible to follow purescript-parsing naming conventions.
Recursive stuff is done with a Fix operator. That'll be important for the WHERE clause. There's an example in the tests.
I made a really tiny sql test just to show the very basics. Theoretically, the API covers anything that one can do in G4, including cobbling together regexes, so one can choose the useful bits from something like their Postgres spec and build up the relevant parts.

easafe commented 2 months ago

That was fast; incredible work. It should make it easier to write a more proper type checker too. How are you finding compilation times so far?

mikesol commented 2 months ago

The compilation times are manageable so far, the tests have lots of examples and compile reasonably fast in my IDE.

mikesol commented 2 months ago

Had a bit of time to stress test the lib with CREATE statements: https://github.com/mikesol/purescript-tldr/blob/a4d0b73c67ab7a093c9c6d1ba248263b97f06f11/test/Test/Sql1.purs#L77. It's quite fast, I didn't notice any lag in my IDE on recompilation. My hope is to establish the world of tables this way, which will make it easy to use tools like apgdiff for migrations. What'd be even better is if vite could somehow inject a full .sql file into the type so that you could keep all the table definitions outside of PureScript.

If you hover over mytables0 in the IDE, it generates this type, which IMO is usable for downstream parsing:

Proxy (Success (Cons (TableDef (Proxy "foo") (Cons (ColumnDef (Column (Proxy "id")) SInt) (Cons (ColumnDef (Column (Proxy "lastname")) SText) (Cons (ColumnDef (Column (Proxy "firstname")) SText) Nil)))) (Cons (TableDef (Proxy "bar") (Cons (ColumnDef (Column (Proxy "id")) SInt) (Cons (ColumnDef (Column (Proxy "time_created")) SInt) (Cons (ColumnDef (Column (Proxy "nickname")) SText) Nil)))) (Cons (TableDef (Proxy "baz") (Cons (ColumnDef (Column (Proxy "id")) SInt) (Cons (ColumnDef (Column (Proxy "username")) SText) Nil))) (Cons (TableDef (Proxy "qux") (Cons (ColumnDef (Column (Proxy "id")) SInt) (Cons (ColumnDef (Column (Proxy "user1")) SText) (Cons (ColumnDef (Column (Proxy "user2")) SText) (Cons (ColumnDef (Column (Proxy "user3")) SText) (Cons (ColumnDef (Column (Proxy "user4")) SText) (Cons (ColumnDef (Column (Proxy "user5")) SText) Nil))))))) (Cons (TableDef (Proxy "goo") (Cons (ColumnDef (Column (Proxy "id")) SInt) (Cons (ColumnDef (Column (Proxy "last_updated")) SInt) Nil))) (Cons (TableDef (Proxy "fooA") (Cons (ColumnDef (Column (Proxy "id")) SInt) (Cons (ColumnDef (Column (Proxy "lastname")) SText) (Cons (ColumnDef (Column (Proxy "firstname")) SText) Nil)))) (Cons (TableDef (Proxy "barA") (Cons (ColumnDef (Column (Proxy "id")) SInt) (Cons (ColumnDef (Column (Proxy "time_created")) SInt) (Cons (ColumnDef (Column (Proxy "nickname")) SText) Nil)))) (Cons (TableDef (Proxy "bazA") (Cons (ColumnDef (Column (Proxy "id")) SInt) (Cons (ColumnDef (Column (Proxy "username")) SText) Nil))) (Cons (TableDef (Proxy "quxA") (Cons (ColumnDef (Column (Proxy "id")) SInt) (Cons (ColumnDef (Column (Proxy "user1")) SText) (Cons (ColumnDef (Column (Proxy "user2")) SText) Nil)))) (Cons (TableDef (Proxy "gooA") (Cons (ColumnDef (Column (Proxy "id")) SInt) (Cons (ColumnDef (Column (Proxy "last_updated")) SInt) Nil))) Nil)))))))))) "")

easafe commented 2 months ago

Thats pretty good. I will take a look sometime this week

easafe commented 1 month ago

What solution are you currently using @mikesol ? I didnt forget about this, just havent had the time yet

mikesol commented 1 month ago

Do you mean for SQL? I'm using https://github.com/mikesol/oooooooooorrrrrrrmm .

easafe / purescript-droplet

Type level parsing of raw sql #17