facebookincubator / CG-SQL

CG/SQL is a compiler that converts a SQL Stored Procedure like language into C for SQLite. SQLite has no stored procedures of its own. CG/CQL can also generate other useful artifacts for testing and schema maintenance.
https://cgsql.dev/
MIT License
392 stars 34 forks source link

Simplified CQL Syntax #168

Open RaoulFoaleng opened 1 year ago

RaoulFoaleng commented 1 year ago

CQL current language syntax is generally not common to other popular languages used for mobile development: Swift, Kotlin, TypeScript, etc We’d like to integrate into the CQL language a set of common syntax features that people are familiar with. In the goal of making the language more common and easy for people to understand it quickly and use it efficiently.

Style: Stop using SHOUT CASE

CREATE PROCEDURE <name>(

This is a minor stylistic issue but something we could lint for.

create procedure <name>(

Change “create procedure” to “function”

CREATE PROCEDURE <name>(

Use a more modern style

function <name>(

Allow “private” keyword on functions

@attribute(cql:private)
CREATE PROCEDURE <name>

Use language primitives more similar to class-based languages.

private function <name>

Updated type annotations

CREATE PROCEDURE tester(
  thread_pk_ LONG INT NOT NULL,
  contact_pk_ LONG INT
)

Use more modern style annotations that match an existing popular language.

CREATE PROCEDURE test_function(
  thread_pk_: long,
  contact_pk_: long?

Begin/End → {}

CREATE PROCEDURE tester()
BEGIN
  ...
END;

Use curly braces brackets.

CREATE PROCEDURE tester() {
  ...
}
function tester()
{
  ...
}

Allow Trailing Commas

CREATE PROCEDURE tester(
  thread_pk_ LONG INT NOT NULL,
  contact_pk_ LONG INT
)

Allowing trailing commas is a small change but it allows for more consistency in typing and produces cleaner diffs because adding a parameter does not require editing the comma above.

CREATE PROCEDURE tester(
  thread_pk_ LONG INT NOT NULL,
  contact_pk_ LONG INT,
)

Remove “Call”

CALL subfunction(a, b, c);

Using CALL is non-standard compared to other languages. Instead just allow functions to be called directly.

subfunction(a, b, c);

Get rid of “SET” and “:=”

LET count := 5;
SET count := count + 1;

These are also not similar to how modern languages work.

let count = 5;
count += 1;

Support Interpolation Syntax

SELECT a
FROM foo
WHERE bar = bar_;

The fact that we insert variables directly into SQL does not require underscores, but it necessitates it to make the code readable. We should consider some other way of denoting variables inside of SQL, perhaps something that matches how modern languages do string interpolation.

SELECT a
FROM foo
WHERE bar = ${bar};

Example:

CQL:

@attribute(cql:private)
CREATE PROCEDURE tester (
 thread_pk_ LONG INT NOT NULL @sensitive,
 contact_pk_ LONG INT
)
BEGIN
 LET count := 5;
 SET count := count + 1;

 CALL subfunction(count);

 SELECT a
   FROM foo
   WHERE bar = bar_;
END;

Simplified CQL:

private function tester(
  thread_pk_: long @sensitive,
  contact_pk_: long?,
) 
{
  let count = 5;
  count += 1;

  subfunction(count);

  SELECT a
    FROM foo
    WHERE bar = ${bar_};
}

Both codes are semantically the same and produce the exact same output except that the latter is using commonly known syntax features (=, ?, +=, etc) that are already used in a bunch of languages which make it easy for developers to understand CQL language. To support Simplified CQL syntax we don’t really need to make deeper changes in the CQL compiler. Just adding new grammar rules and a new AST transform step before semantic analysis step (to transform the AST of the simplified CQL code to existing CQL code) are enough to achieve this.

cc @ricomariani , @toddkrabach

mingodad commented 1 year ago

Of course you can do it in a fork and try to see if people will prefer it. I personally think that the actual syntax is fine in general.

ricomariani commented 1 year ago

I've never been a big fan of going in this direction, not because I fear syntax changes but more because I just don't think it's the best use of limited resources to make superficial changes. We've done a few in the last year but all of those were 10 minute jobs. Some of these have more profound consequences and materially change the feel of the langauge. Not necessarily in a good way.

When comparing syntax used by CQL the most important point of comparison is a normal stored procedure language. The ones that have been in this space the longest and had time to consider the problems unique to this space. The ones users might be familiar with and might want to migrate to. CQL already has signficant extensions but it still feels very SQLish. It's also a bit close-minded to consider only the languages with C-like syntax. Modern languages also include Python, Ruby, Haskell, F#, and Lua. Even if you shun the old guard COBOL, Fortran, Algol, Basic... There are many ways to do this well.

But the main observation is that most of these things do not materially contribute to anyone's actual pain and there are much more serious problems that do. For instance, a CQL beautifier would go a million miles towards making the dev experience better. CQL REPL (maybe incorporating the @mingodad work) again would be awesome. These things are big quality of life improvers. But notwithstanding all that, let's go over the list and discuss each one.

Style: Stop using SHOUT CASE

I'd fix this with an auto beautifier. But many people seem to like CREATE PROC even though create proc works. Personally I dislike the all caps but when in Rome.

Change “create procedure” to “function”

CQL distinguishes functions from procedures. So you already have declare function. You'd need to do something about that. This ends badly. IF you want to make this easier make the create optional. We did a bunch of "Carbon like" forms.

PROC name is short and conflicts with nothing. FUNCTION just confuses the issue. In SQL procs are different beasts.

Private functions

private isn't going to make a great new keyword. It will conflict. But maybe you'd like simplified attributes? Something like [private, cool(5), unit_test] proc foo would break nothing and means the same as @attribute(cql:private) etc. That one is a 10 minute job :D

Use more modern style annotations that match an existing popular language.

You will never succeed at changing LONG to mean LONG NOT NULL -- and if you did that stuff it would break all kinds of schema. You'd no longer be able to parse SQLites own output. Consider instead maybe long! to mean long not null and long? is the same as just long but explicit. You could add a strict mode requiring ? if you really wanted to. This one would be a crazy uphill battle and if you pulled it off then reading CQL schema would be unlike reading any other schema. Just ! to avoid typing not null a lot might go a long way and that's easy.

In this example:

PROC test_function(
  thread_pk_: long,
  contact_pk_: long?
)

The : adds nothing, it could be just whitespace and : already has meaning in CQL.

Trailing , already works for args I think (though its an accident) supporting it everywhere wouldn't hurt anything.

Use curly braces

I can't really support this at all:

proc tester()
{
  ...
}

There are lots of reasons for this, the first is that the braces are not in any way "better" but the most important reason is this: You are not in C. You are not in Typescript. You are not in Java. Those languages all have expression binding order that is very similar (even the wierd bits). Giving people clear cues that this is not a C context is really important because the binding strength must be the SQL strength (if it wasn't then the same expressions would evaluate differently in SQL contexts vs. non SQL contexts (unless you plan to rewrite the SQL to a different paren shape? That's a horrible idea don't do that.)

Expressions like a | b & c are not the same. But there are other bigger differences, like how NULL is handled for instance. You really do not want people thinking it's just like typescript. It isn't. It's like SQL.

Allow Trailing Commas

Trailing , already works for args I think (though its an accident) supporting it everywhere wouldn't hurt anything.

Remove CALL

In a language with statements not the same as expressions and a yacc parser you get exactly one chance to put an arbitrary identrifier at the start of a statement. After that you will start to get shift reduce conflicts. Think this one through carefully. At the moment the left side of an assignment is always super simple and has := so you can have both but not for long. You close off a lot when you do this. Similar warning for SET below. That one has more utility.

Get rid of “SET” and “:=”

In SQL = is equality not assignment. If you get rid of := you will make tons of code be erroneous. On the other hand you can get rid of SET and maybe be ok. But only because presently the left hand side of := is limited. If you ever wanted to do properties like

x.y := 5 then you have shift/reduce conflicts without the set. To fix that you would have to make assignment an expression type which would cause all manner of other issues because assignment is only a top level thing right now. The codegen knows this and makes better use of temporaries because of it.

Someday you might want to do x[1] := 5 -- adding build in arrays to the language, or even dictionaries is a big win. No problems if SET stays. Take it out and you get a major grammar refactor.

LET count := 5; SET count := count + 1;

These are also not similar to how modern languages work.

Looks just like Python, or Lua to me.

Language Syntax to increment an integer
Lua num = num + 1 or num += 1
Haskell let num' = num + 1
Python num += 1 or num = num + 1
Ruby num += 1 or num = num + 1
JavaScript num++ or num += 1 or num = num + 1
Java num++ or num += 1 or num = num + 1
C num++ or num += 1 or num = num + 1
Swift num += 1 ornum = num + 1
T-SQL SET @num := @num + 1
Basic num = num + 1 or num += 1

Support Interpolation Syntax

SELECT a FROM foo WHERE bar = bar_; The fact that we insert variables directly into SQL does not require underscores, but it necessitates it to make the code readable.

This isn't at all about readbility. The name is ambiguous. The first bar refers to the column. The second refers to a local variable. This isn't just triviality. bar = bar is trivially true for all rows regardless of which bar you pick. So they must be different. This isn't at all about Interpolation. The engine is not confused what is an expression and what isn't, the ambiguity is genuine.

If you like you could use $bar to mean "for sure the local variable" and then you could make it so that bar means "for sure the column" except what about when bar is used in some other context? Some SQLs take the position that ALL locals must begin with @.

You could implement $bar pretty quickly but you would have to make it so that the resolution of the ambiguity is clear to the reader in all other cases. I guess you could make it so that in SQL contexts you MUST use $foo. What about cursors? C.foo? My head hurts.

Importantly, making parallels to string interpolation is unhelpful. This isn't an interpoloation problem at all. There is no problem with "I don't want to escape foo", the compiler knows the identifiers, they are just ambiguous. If did get the $ business working you would basically have changed all the identifiers from _foo to $foo. I'm not sure how that helps... ${foo} is just worse. Lots of languages have this problem. E.g., people write m_foo = foo for the same reasons.

Summary:

With simple stuff you could probably do this:

CQL:

@attribute(cql:private)
CREATE PROCEDURE tester (
 thread_pk_ LONG INT NOT NULL @sensitive,
 contact_pk_ LONG INT
)
BEGIN
 LET bar_ := 5;
 SET bar_ := bar_ + 1;

 CALL subfunction(bar_);

 SELECT a
   FROM foo
   WHERE bar = bar_;
END;

Becomes:

[private] proc tester (
 thread_pk_ long! @sensitive,
 contact_pk_ long?
)
BEGIN
 LET bar_ := 5;
 SET bar_ += 1;  -- not mentioned but doable

 CALL subfunction(bar_);

 SELECT a
   FROM foo
   WHERE bar = bar_;
END;

Be careful, some of this stuff has profound consequences and you could easily waste weeks of work to get nothing. To say nothing of what it would take to fix the codebase. I would really consider that beautifier project. It will also give you leverage for doing big refactors later!

ricomariani commented 1 year ago

Trivia: the table for how languages increment an integer was generated by ChatGPT

ricomariani commented 1 year ago

The most important advice I can give you is this: if you do any of this stuff have a care for the future. You need to think about stuff you want to do in the future to decide grammar today. If you add some of these constructs you will cut off a lot of future paths; you want to do that carefully.

joshuaevenson commented 1 year ago

I think there is a lot of common ground to be had here that could meaningfully improve the general feel of the language. I think that it is true Rico that these are not earth shattering changes for productivity, but I do think that code reading becomes easier when code has more similarity with more modern popular languages.

Amending your example slightly with more of the suggestions from above, the code below looks very legible to me, and this is mostly minor stylistic changes:

[private] proc tester (
 $thread_pk long! @sensitive,
 $contact_pk long?
)
begin
 let $bar := 5;

 call subfunction($bar);

 select a
   from foo
   where bar = $bar;
end;
ricomariani commented 1 year ago

The same above I think is very doable without much risk of forward breakage. $foo is really not any better than _foo except you know for sure​ it can't confict with a column. You could just allow leading $ in variable names (only) and call it good on that front. Cursors don't need this but you could also let $C be a cursor leading to $C.foo


From: joshuaevenson @.> Sent: Wednesday, February 22, 2023 9:38 AM To: facebookincubator/CG-SQL @.> Cc: Rico Mariani @.>; Mention @.> Subject: Re: [facebookincubator/CG-SQL] Simplified CQL Syntax (Issue #168)

I think there is a lot of common ground to be had here that could meaningfully improve the general feel of the language. I think that it is true Rico that these are not earth shattering changes for productivity, but I do think that code reading becomes easier when code has more similarity with more modern popular languages.

Amending your example slightly with more of the suggestions from above, the code below looks very legible to me, and this is mostly minor stylistic changes:

[private] proc tester ( $thread_pk long! @sensitive, $contact_pk long? ) begin let $bar := 5;

call subfunction($bar);

select a from foo where bar = $bar; end;

— Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookincubator%2FCG-SQL%2Fissues%2F168%23issuecomment-1440487414&data=05%7C01%7C%7Cfbbf5cf0a82249f3bb6408db14fba6ca%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638126843285822720%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=JBdqAfW8AVW2CaZQ5XFZ7C3S49ZUPSKoZKdVGSWcLPs%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHC3V77XUVPF3N6CW37DPPLWYZFKNANCNFSM6AAAAAAVDQAI4I&data=05%7C01%7C%7Cfbbf5cf0a82249f3bb6408db14fba6ca%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638126843285822720%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=NpCsTpedPdHVG6FSBUTKTZGW68TTn7r8UFsLUhNwcQo%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

ricomariani commented 1 year ago

Oh I guess one subtle thing. You probably don't want to allow variables named bar and $bar in any one scope. And if you do that then you can still use just bar in the C/lua codegen for debugging. Strip the leading $. Otherwise it pretty much just works. You might want to change the argument forms to this as well.

So like proc foo(like tableshape) creates arguments with $ in the name rather than leading . That would require some fixing but it's probably worth the pain.

But I still think investing in a beautifier would give you more value for the dollar.

If you make it so that $ is a valid leading character in identifiers and then make it an error if it appears in column names in schema etc. that's a pretty easy path to getting the support you want. At some point you might even add a strict rule that forces all locals/args to start with $.


From: joshuaevenson @.> Sent: Wednesday, February 22, 2023 9:38 AM To: facebookincubator/CG-SQL @.> Cc: Rico Mariani @.>; Mention @.> Subject: Re: [facebookincubator/CG-SQL] Simplified CQL Syntax (Issue #168)

I think there is a lot of common ground to be had here that could meaningfully improve the general feel of the language. I think that it is true Rico that these are not earth shattering changes for productivity, but I do think that code reading becomes easier when code has more similarity with more modern popular languages.

Amending your example slightly with more of the suggestions from above, the code below looks very legible to me, and this is mostly minor stylistic changes:

[private] proc tester ( $thread_pk long! @sensitive, $contact_pk long? ) begin let $bar := 5;

call subfunction($bar);

select a from foo where bar = $bar; end;

— Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookincubator%2FCG-SQL%2Fissues%2F168%23issuecomment-1440487414&data=05%7C01%7C%7Cfbbf5cf0a82249f3bb6408db14fba6ca%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638126843285822720%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=JBdqAfW8AVW2CaZQ5XFZ7C3S49ZUPSKoZKdVGSWcLPs%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHC3V77XUVPF3N6CW37DPPLWYZFKNANCNFSM6AAAAAAVDQAI4I&data=05%7C01%7C%7Cfbbf5cf0a82249f3bb6408db14fba6ca%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638126843285822720%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=NpCsTpedPdHVG6FSBUTKTZGW68TTn7r8UFsLUhNwcQo%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

RaoulFoaleng commented 1 year ago

Hey @ricomariani,

Thanks for the feedbacks, they're really valuable. I think we have a good compromise here from your example and @joshuaevenson one.

[private] proc tester (
 $thread_pk long! @sensitive,
 $contact_pk long?,
)
begin
 let $bar := 5;
 set $bar += 1;

 call subfunction($bar);

 select a
   from foo
   where bar = $bar;
end;

I'd like get back a bit on the other stuff I proposed that you pushed back on to get some clarity and better understanding. Mainly the one that you said are going to break the CQL grammar or can't work.

Begin/End → {}

There are lots of reasons for this, the first is that the braces are not in any way "better" but the most important reason is this: You are not in C. You are not in Typescript. You are not in Java. ...

I make a distinction between the SQL syntax and CQL syntax that is not part of the standard SQL. People generally refer to online doc to learn SQL syntax and features and likely expect that online syntax to work in CQL which is the case. Now all the extra syntax (begin/end, etc) CQL is adding could be made more common to modern language for people to understand it better because there is no common standardized documentation out there for people to learn it from. SQL stays SQL but the rest of the syntax could be more common to modern languages.

[private] proc tester() → private proc tester()

private isn't going to make a great new keyword. ...

Seems like this is your main pushback against this. What I can say is that It'll feel more common to people and more align with modern languages. [...] seems to make thing even worse from readability standpoint that the attribution actually feel better.

Remove CALL

Yeah you're correct on this one and shift reduced might explode on this. I'll carefully look into this

Get rid of “SET” and “:=”

Yeah, has the same issue as removing CALL.

One thing to callout is that this proposal is to extends and NOT replace existing syntax. Basically the AST coming out of this simplified syntax is transformed to existing AST before semantic analysis step. Same as the existing rewrite feature in CQL but the difference will that this happens before the semantic analysis start.

The AST from:

private proc tester (
 a: long @sensitive,
 b: long?,
)
{
  let $bar := 5;
  set $bar += 1;

  call subfunction($bar);

  select a
    from foo
    where bar = $bar;
}

is turned into the AST for:

@attribute(cql:private)
create proc tester (
 a: LONG NOT NULL @sensitive,
 b: LONG
)
{
  let bar_ := 5;
  set bar_ := bar_ + 1;

  call subfunction(bar_);

  select a
    from foo
    where bar = bar_;
}

We want to make these changes upstream (as much as possible) without affect CQL downstream toolchains. By toolchains I mean: semantic analysis step, codegen step, other runtimes and all the tools that depend on CQL compiler outputs . In case we have to touch them, the change should be minor.

RaoulFoaleng commented 1 year ago

@ricomariani, @joshuaevenson here is the draft #1 that allow you this syntax. We haven't agreed on the curly braces {} but i wanted @ricomariani to take a second fresh look at. Also putting "private" in from work actually. I tried it in my previous draft. I'll put it in draft #2 to get your feedbacks.

procedure f (
  a: text,
  b: long?,
  c: int,
  d: real?,
  e: blob @sensitive,
  f: my_type?
) {
  select a, b
  union all
  select "a" a, 1L b;
}

https://github.com/facebookincubator/CG-SQL/commit/3671a73e2b0dca541cf89ff627b164f7a77ee889