Add support for comments

benjie commented 6 years ago

It seems that the native PostgreSQL query parser simply treats comments as whitespace (and drops them). See: https://github.com/lfittl/libpg_query/issues/15

To support comments we're either going to have to do something ugly with regexps (ick), write our own AST parser (ugh), or dive into the C-code of libpg_query, maybe starting around here: https://github.com/lfittl/libpg_query/blob/cd9e4b33cef948461c0354a81a23082b16c6a8d1/src/postgres/src_backend_parser_parser.c#L43-L70

This is a major issue. Without comments, we shouldn't really consider progressing.

benjie commented 6 years ago

Worth a look is the Postgres BISON rules:

https://github.com/postgres/postgres/blob/master/src/backend/parser/gram.y

benjie commented 6 years ago

Also of interest, particularly these lines of the lexer config:

Definition of comment types:

https://github.com/postgres/postgres/blob/f033462d8f77c40b7d6b33c5116e50118fb4699d/src/backend/parser/scan.l#L327-L328 https://github.com/postgres/postgres/blob/f033462d8f77c40b7d6b33c5116e50118fb4699d/src/backend/parser/scan.l#L202

Parsing C-style comments

https://github.com/postgres/postgres/blob/f033462d8f77c40b7d6b33c5116e50118fb4699d/src/backend/parser/scan.l#L401-L435

Absorbing SQL comments into whitespace:

https://github.com/postgres/postgres/blob/f033462d8f77c40b7d6b33c5116e50118fb4699d/src/backend/parser/scan.l#L204-L214

azz commented 6 years ago

I'm not a db guy, but if comments are just -- and /* */ it should be simple enough to write a comment parser that runs in tandem with the actual language parser, you'd then attach the comments to root.comments on the AST and it should be printable.

cc @vjeux who knows a lot more about Prettier comments.

vjeux commented 6 years ago

@azz is correct. Note that you just need a tokenizer of the language to extract out comments. For example you need to make sure that comments inside of strings are not treated as comments. But writing a tokenizer that extracts comments is trivial compared to a full parser.

Once you have all the comments and location, you can feed it to prettier and you’ll get a relatively good heuristic out of the box to print them

benjie commented 6 years ago

Awesome; can you point me at a simple-ish example? I’m also interested to know how the comments might work mid-statement - do we just attach them to the end of the statement or is there a way to maintain their association (e.g. with a specific column in a big “create table” statement)?

vjeux commented 6 years ago

The logic to figure out where to attach nodes is here: https://github.com/prettier/prettier/blob/master/src/main/comments.js#L183

Each comment must be attached to an ast node a mode of either "before, after or inside". Then when we print each node, we automatically print the associated comments at the right place. So the print code is mostly free of comment handling.

In order to find where to attach comments, we use heuristics. For example if the comment is on its own line, we attach it "before" the next ast node. If it's the last token of the line and there's something before, we attach it "after" the previous ast node, etc...

Those simple heuristics work most of the time, but people are imaginative in where they put comments, so we have custom logic to figure out where to put them based on the ast structure.

benjie commented 6 years ago

That's really cool! I suspect it's going to take me a while to figure this all out, but at least I don't have to go writing hard to maintain patches against C-code!

vjeux commented 6 years ago

What you just have to do is in the returned object from parser have a field comments which is an array of comments with their position and it'll just work(tm). You can see how I did it for python: https://github.com/prettier/prettier-python/pull/9

When you find cases where it's not adding comments just right, now you'll dig into it.

benjie commented 6 years ago

Is there a particular format the that AST needs to adhere to? I'm trying to figure out how prettier knows what the comments relate to. In prettier-python there seems to be start/end values on the comments at least (possibly on every node?); in the GraphQL parser it refers to line and column. In the PG AST parser we get location for very few AST nodes, in the form of position (which is bytes since the beginning of the file); e.g.:

select 'hi';

gives the location of the value 'hi' only:

      [ { SelectStmt:
           { targetList:
              [ { ResTarget:
                   { val: { A_Const: { val: { String: { str: 'hi' } }, location: 7 } },
                     location: 7 } } ], // < Location here
             op: 0 } } ]

The more complex document:

create function foo() returns text as $$ select 'hi'; $$ language sql;
create function foo() returns text language plv8 as $$ var a = 1; var b = 2; return a + " + " + b+"="+(((a + b))); $$;
create function foo() returns text language plv8 as $$
var a = 1;
  var b = 2;
  return a + " + " +
b+"="+(((a + b)));
$$;

gives the location of the return types (text) and that is all.

     [ { CreateFunctionStmt:
           { funcname: [ { String: { str: 'foo' } } ],
             returnType:
              { TypeName:
                 { names: [ { String: { str: 'text' } } ],
                   typemod: -1,
                   location: 30 } }, // < Location here
             options:
              [ { DefElem:
                   { defname: 'as',
                     arg: [ { String: { str: ' select \'hi\'; ' } } ],
                     defaction: 0 } },
                { DefElem:
                   { defname: 'language',
                     arg: { String: { str: 'sql' } },
                     defaction: 0 } } ] } },
        { CreateFunctionStmt:
           { funcname: [ { String: { str: 'foo' } } ],
             returnType:
              { TypeName:
                 { names: [ { String: { str: 'text' } } ],
                   typemod: -1,
                   location: 101 } }, // < Location here
             options:
              [ { DefElem:
                   { defname: 'language',
                     arg: { String: { str: 'plv8' } },
                     defaction: 0 } },
                { DefElem:
                   { defname: 'as',
                     arg:
                      [ { String:
                           { str: ' var a = 1; var b = 2; return a + " + " + b+"="+(((a + b))); ' } } ],
                     defaction: 0 } } ] } },
        { CreateFunctionStmt:
           { funcname: [ { String: { str: 'foo' } } ],
             returnType:
              { TypeName:
                 { names: [ { String: { str: 'text' } } ],
                   typemod: -1,
                   location: 220 } }, // < Location here
             options:
              [ { DefElem:
                   { defname: 'language',
                     arg: { String: { str: 'plv8' } },
                     defaction: 0 } },
                { DefElem:
                   { defname: 'as',
                     arg:
                      [ { String:
                           { str: '\nvar a = 1;\n  var b = 2;\n  return a + " + " +\nb+"="+(((a + b)));\n' } } ],
                     defaction: 0 } } ] } } ]

I'm assuming I'm going to have to mod the parser to add location information to more nodes in order to figure out where to attach the comments?

vjeux commented 6 years ago

Right now the location extraction code is hardcoded here: https://github.com/prettier/prettier/blob/master/src/common/util.js#L241-L298

We need to move it to the plugin system, if you are interested in doing it that would be great!

benjie commented 6 years ago

Exciting news! I did some digging and it turns out that although the current version of pg-query-native does not output statement position/length, the latest versions of libpg_query do! And adding support to pg-query-native is just a case of:

checkout libpg_query and then run MACOSX_DEPLOYMENT_TARGET="10.7" make to build the latest binary for OS X (we also need to make one for Linux)
check out node-pg-query-native and overwrite libpg_query/linux/libpg_query.a and libpg_query/include/pg_query.h with the libpg_query.a and pg_query.h that were generated in step 1
run npm install to install the deps and link against the new library

Note that running npm test at this point will fail because there's an extra level of indirection now. Lines like:

    assert.equal(typeof query.parse("select 1").query[0].SelectStmt, "object");

Must now have RawStmt.stmt injected, like this::

    assert.equal(typeof query.parse("select 1").query[0].RawStmt.stmt.SelectStmt, "object");

Then the tests will pass again, and viola!

Note that statement positions in Postgres are determined by the locations of the semicolons, so select 1 won't have any, but select 1; will. Note also that the first statement does not get a stmt_location - you can assume that it's zero.

Here's an example of the generated AST for "\n\n\nselect 1; select 2;\nSelect 3;":

{ query:
   [ { RawStmt:
        { stmt:
           { SelectStmt:
              { targetList:
                 [ { ResTarget:
                      { val: { A_Const: { val: { Integer: { ival: 1 } }, location: 8 } },
                        location: 8 } } ],
                op: 0 } },
          stmt_len: 9 } },
     { RawStmt:
        { stmt:
           { SelectStmt:
              { targetList:
                 [ { ResTarget:
                      { val: { A_Const: { val: { Integer: { ival: 2 } }, location: 18 } },
                        location: 18 } } ],
                op: 0 } },
          stmt_location: 10,
          stmt_len: 9 } },
     { RawStmt:
        { stmt:
           { SelectStmt:
              { targetList:
                 [ { ResTarget:
                      { val: { A_Const: { val: { Integer: { ival: 3 } }, location: 28 } },
                        location: 28 } } ],
                op: 0 } },
          stmt_location: 20,
          stmt_len: 9 } } ],
  stderr: '' }

Given this, we just need to write a comment scanner and try to attach the comments to the relevant places.

Here's the AST for

create table foo (
  id int, -- no primary key because I just want a reason for a comment
  bar text, -- nullable
  baz text not null -- Baz never goes away
);

{ query:
   [ { RawStmt:
        { stmt:
           { CreateStmt:
              { relation:
                 { RangeVar: { relname: 'foo', inh: true, relpersistence: 'p', location: 13 } },
                tableElts:
                 [ { ColumnDef:
                      { colname: 'id',
                        typeName:
                         { TypeName:
                            { names:
                               [ { String: { str: 'pg_catalog' } },
                                 { String: { str: 'int4' } } ],
                              typemod: -1,
                              location: 24 } },
                        is_local: true,
                        location: 21 } },
                   { ColumnDef:
                      { colname: 'bar',
                        typeName:
                         { TypeName:
                            { names: [ { String: { str: 'text' } } ],
                              typemod: -1,
                              location: 96 } },
                        is_local: true,
                        location: 92 } },
                   { ColumnDef:
                      { colname: 'baz',
                        typeName:
                         { TypeName:
                            { names: [ { String: { str: 'text' } } ],
                              typemod: -1,
                              location: 120 } },
                        is_local: true,
                        constraints: [ { Constraint: { contype: 1, location: 125 } } ],
                        location: 116 } } ],
                oncommit: 0 } },
          stmt_len: 160 } } ],
  stderr: '' }

pyramation commented 5 years ago

did you get comments working by chance?

@benjie you may want to use my parser which uses a postgres 10 lib_pgquery out of the box! it's quite ahead of the main fork you can see here (currently 128 commits): https://github.com/pyramation/pg-query-parser/network

I've added a TON of statement types and also completely updated the testing system and made it easy to test different use cases. It has a series of SQL files to test against, and compares the results using ASTs, so it's very extensible as things are needed by different users.

benjie commented 5 years ago

I was looking at that only a couple days ago as it happens @pyramation - great work!

I'm thinking at the moment that we may need to handle comments via a separate parser as azz/vjeux comment; but I need to analyse what that parser needs to understand - e.g. select '/* something */'; and select $$ something /* */$$; should not parse those string contents as comments; so it needs a basic understanding of some of the PostgreSQL syntax. Determining exactly how much of the syntax needs this is a challenge. It seems that the PG query tokeniser has no interest in representing comments (even though it counts comment depth) so we're going to have to roll our own. Adding it to PostgreSQL's parser seems attractive, but I fear maintaining the patch is going to be a big pain.

brentjanderson commented 5 years ago

This would be more of a stopgap for comments, but what if a separate comment parser roughly extracted comments, then scanned the AST for partials that matched existing code? It’s not perfect, but it could be a start.

benjie commented 5 years ago

Yeah; that's what @azz suggested above - effectively running two parsers, one for comments and one for the rest of the language. We'll have to be careful that the comment parser can also parse SQL strings ($$, $_$ , etc) so that it doesn't try and read the comments inside of function bodies (which may be in a different language) but I think it's quite viable. I've not looked into actually doing it yet though.

Currently all comments are dropped; e.g.

https://github.com/benjie/prettier-plugin-pg/blob/a3ce58632a28ca18ed04b0302b1e771070d93d11/tests/BEGIN/simple.sql#L1-L2

becomes:

https://github.com/benjie/prettier-plugin-pg/blob/a3ce58632a28ca18ed04b0302b1e771070d93d11/tests/BEGIN/__snapshots__/jsfmt.spec.js.snap#L43-L44

benjie commented 5 years ago

If anyone feels like taking this on, one interesting fact about SQL comments is that, unlike JS (and C) comments, they nest. i.e. the following is completely ignored by the parser as a comment:

/*

outer comment

   /* inner comment */

end of outer comment

*/

vjeux commented 5 years ago

If the parser is able to drop comments, this means that it can parse them just fine. All prettier needs is the list of the comments and their location. It’s probably easier to tweak the parser to record them instead of dropping them fwiw.

benjie commented 5 years ago

I looked at having the PostgreSQL parser record the comments before, but it's way beyond me. If anyone wants to dig in I did a bunch of research last year that's right at the top of this thread (with links into the BISON stuff and the parser code). I even considered re-implementing their parser ourselves in one of the many JS parser implementations but decided that was going to be too much of a pain to maintain.

I think if we can maintain a patch on their parser that actually passes the comments through that would be our best bet, but I don't grok enough of the code to do this myself without sinking in significant time.

pyramation commented 5 years ago

If there is any format issues you have that pgformatter doesn't do, I'd recommend filing issues here: https://github.com/darold/pgFormatter/issues as Darold is super fast! He knocked out about 7-8 issues I had with the formatter, and now I'm using it finally.

after that, I reached out to the vscode plugin maintainer, and he added a flag which pretty much makes the pgformat usable now https://github.com/bradymholt/vscode-pgFormatter/issues/13

benjie commented 5 years ago

Does it format the function bodies (plv8, plpython, etc?)

pyramation commented 5 years ago

I think he made it so it doesn't format those and leaves them in tact... which leads me to think there is two potential solutions.

see if we can tap in and write plugins for different languages in pgFormatter
write a formatter that only formats function bodies of plv8, plpython, etc., and then do a 2nd pass via pgFormatter to handle the SQL

jackturnbull commented 5 years ago

I don't really have the expertise on this one to chip in (in a meaningful way), but I can explain why my use case would depend on this;

Many 'modern' [1], [2] SQL migration tools use comments to deliniate the 'up' portion of a migration and the 'down' portion meaning that if comments were to be stripped then we'd lose the necessary identifiers for the tools to run - you might be all aware of this anyway but I suspect a lot of users of these sorts of tools would be interested to implement this into their workflow when complete!

This isn't to rush you guys, this is open source and seems like a complex issue just calling out why it might be an important one for many people - thanks for keeping up the effort!

benjie commented 5 years ago

Yeah, comments are vital - that’s the thing to solve before we can continue work on this project.

benjie commented 5 years ago

:grin:

kevboh commented 4 years ago

👀

I spent the weekend experimenting with different approaches to prettifying postgres, then realized: comments! And thus found my way here.

I (and, in fact, the rest of the company I work for!) am very interested in figuring this out. I don't think I know enough C to maintain a major parser rewrite, but I'm happy to dig into the pg scan source and maybe learn some BISON. A cursory glance makes me hopeful: all of the symbols for comments and extended comments are already there, so it seems like the patch would be a matter of returning a proper symbol for comments and updating the grammar to match. (I realize, on rereading this thread, that you linked to all of this above! Whoops.)

While I'm here: I'd love for this to be TypeScript, too :)

benjie commented 4 years ago

That would certainly be welcome! A postgres parser that supports comments would be an excellent resource outside of this project, there are many uses for it 😁

benjie commented 4 years ago

Status update; I spent a few hours working on this this weekend, I've achieved:

Upgraded all the dependencies (most notably we're now on Prettier v2)
Upgraded to @pyramation's pg-query-native-latest for RawStmt support
Converted to TypeScript (weak though, because pg-query-native-latest isn't strongly typed, because libpg_query doesn't define types, because PostgreSQL really doesn't define the shape of these results)
Improved our linting
Reformatted the codebase (ESLint/prettier)
Got all the old tests passing again
Added locStart/locEnd support for RawStmt, Document and comments
Restructured various things ready to support comments

I decided before writing a comment parser I'd like to see what prettier needs, so I tried adding a fake node like the reverse of this commit:

https://github.com/benjie/prettier-plugin-pg/commit/348f139c20665c352f64a1485cf3627d3aa5034c

This produces the error:

 FAIL   test  tests/BEGIN/jsfmt.spec.js
  ● Test suite failed to run

    Comment "-- Hello!" was not printed. Please report this error!

      109 | 
      110 | function prettyprint(src, filename, options) {
    > 111 |   return prettier.format(
          |                   ^
      112 |     src,
      113 |     Object.assign(
      114 |       {

      at node_modules/prettier/index.js:14755:13
          at Array.forEach (<anonymous>)
      at ensureAllCommentsPrinted (node_modules/prettier/index.js:14753:15)
      at coreFormat (node_modules/prettier/index.js:14801:3)
      at format (node_modules/prettier/index.js:15019:75)
      at formatWithCursor (node_modules/prettier/index.js:15035:12)
      at node_modules/prettier/index.js:51620:12
      at Object.format (node_modules/prettier/index.js:51640:12)
      at prettyprint (tests_config/run_spec.js:111:19)
      at tests_config/run_spec.js:37:22
          at Array.forEach (<anonymous>)
      at run_spec (tests_config/run_spec.js:24:27)
      at Object.<anonymous> (tests/BEGIN/jsfmt.spec.js:1:90)

I'm not sure what I need to do next to have this comment output in the results.

jgonera commented 4 years ago

Thank you for the update! I'm really looking forward to SQL support in Prettier. Inline SQL (sql tag in JavaScript/TypeScript) is the only part of my code base that is formatted manually.

benjie commented 4 years ago

Here’s some ways you could help, if you were interested:

figure out this issue
send some PRs with more tests
add support for more statements
sponsor me https://github.com/sponsors/benjie

benjie commented 4 years ago

@vjeux

(I've tried to write this so you don't need to read the above.)

If you were to check out master and git revert 348f139c20665c352f64a1485cf3627d3aa5034c (https://github.com/benjie/prettier-plugin-pg/commit/348f139c20665c352f64a1485cf3627d3aa5034c) you'd get the following test failure when you run the tests:

Comment "-- Hello!" was not printed. Please report this error!

   109 | 
   110 | function prettyprint(src, filename, options) {
 > 111 |   return prettier.format(
       |                   ^
   112 |     src,
   113 |     Object.assign(
   114 |       {

   at node_modules/prettier/index.js:14755:13
       at Array.forEach (<anonymous>)
   at ensureAllCommentsPrinted (node_modules/prettier/index.js:14753:15)
   at coreFormat (node_modules/prettier/index.js:14801:3)
   at format (node_modules/prettier/index.js:15019:75)
   at formatWithCursor (node_modules/prettier/index.js:15035:12)
   at node_modules/prettier/index.js:51620:12
   at Object.format (node_modules/prettier/index.js:51640:12)
   at prettyprint (tests_config/run_spec.js:111:19)
   at tests_config/run_spec.js:37:22
       at Array.forEach (<anonymous>)
   at run_spec (tests_config/run_spec.js:24:27)
   at Object.<anonymous> (tests/BEGIN/jsfmt.spec.js:1:90)

You mentioned above that adding to ast.comments should Just Work :tm:; are there some other prerequisites for this? Should I be doing something in my print function to actually output the comments?

Please note:

Document, the root node type, contains a list of RawStmts, which then contain other AST nodes (which don't have location info)
Most of our AST nodes don't have locations (because Postgres doesn't supply them), however:
- every statement (RawStmt) does
- comments (there's only the one, in the commit above) do
- the document (root AST node) does
Almost all the AST nodes use boxed values like { [nodeType]: { ...details } } as a format; this is just how they come from Postgres (it seems prettier generally expects a flatter {type: nodeType, ...details} shape).
Your advice above was during the prettier v1 era, so it would not be surprising if things changed

I tried source diving the language-graphql support, but it wasn't obvious to me what I was missing for PostgreSQL. Any light you can shed would be very helpful :pray:

vjeux commented 4 years ago

You need to export a printComments function like this:

https://github.com/prettier/prettier/blob/master/src/language-graphql/printer-graphql.js#L701

And likely a canAttachComments function defined just below

benjie commented 4 years ago

@vjeux I have those: https://github.com/benjie/prettier-plugin-pg/blob/master/src/index.ts#L96

Perhaps my canAttachComment returns true for too few nodes? The RawStmt nodes should span the document and I have location data for them so I figured they would be a good choice.

benjie commented 4 years ago

After a little more digging, I think my suspicion about lack of location information on deeper nodes being the issue is likely; prettier determines if a comment is leading, trailing or dangling like this:

      } else if (precedingNode) {
        addTrailingComment(precedingNode, comment);
      } else if (followingNode) {
        addLeadingComment(followingNode, comment);
      } else if (enclosingNode) {
        addDanglingComment(enclosingNode, comment);
      } else {
        // There are no nodes, let's attach it to the root of the ast
        /* istanbul ignore next */
        addDanglingComment(ast, comment);

All of our comments are effectively detected as "dangling" since they're always within the bounds of a RawStmt. And these dangling comments are not output, presumably because prettier cannot figure out where to attach them. If I manually edit the location data for the RawStmt to start after the comment then the comment can be detected as a "leading" comment, and prettier then outputs it correctly.

I don't know how to solve this since we don't have sufficient location data for each AST node, and we don't want to just move all comments to the start/end of a statement as this would lose context, e.g.:

create table my_table (
  id uuid primary key, -- Make it hard to guess the number of records
  ...
  username text, -- Optional!
  ...
);

would become

-- Make it hard to guess the number of records
-- Optional
create table my_table (
  id uuid primary key, 
  ...
  username text, 
  ...
);

I'm not sure if this is solvable with the PostgreSQL parser as it is. We might have to write our own, and that's a significant undertaking!

benjie commented 4 years ago

Normally BISON gives you first line/first column, last line/last column:

https://www.gnu.org/software/bison/manual/bison.html#Location-Type

But unfortunately PostgreSQL saw fit to throw this information away:

https://github.com/postgres/postgres/blob/7559d8ebfa11d98728e816f6b655582ce41150f3/src/include/parser/scanner.h#L36-L44

This is frustrating, because PostgreSQL parses create table foo ( and create table "foo" ( identically, so we cannot know which the user entered nor determine the length of the node for ourselves.

Potentially we could look at modifying this in the PostgreSQL parser, but I think that might be a significant undertaking, and keeping it up to date with new PostgreSQL versions is going to be a pain.

rattrayalex commented 4 years ago

If Postgres parses create table foo the same as create table "foo", shouldn't a prettier convert the latter to the former anyway?

benjie commented 4 years ago

@rattrayalex The issue relates to knowing the length of the node (so that we know where to insert the comments) rather than choosing what the output should be.

rattrayalex commented 4 years ago

Right, of course – my apologies. Yeah, hard to imagine working through this without some modifications to the parser. I wonder if the postgres community would be willing to help out. Perhaps they could provide a flag to output more detailed information, including comments and node locations.

benjie commented 4 years ago

It’s a big ask with the parser how it is; I think a custom parser is they way forward.

benjie / prettier-plugin-pg

Add support for comments #5