Open benjie opened 6 years ago
Worth a look is the Postgres BISON rules:
https://github.com/postgres/postgres/blob/master/src/backend/parser/gram.y
Also of interest, particularly these lines of the lexer config:
Definition of comment types:
https://github.com/postgres/postgres/blob/f033462d8f77c40b7d6b33c5116e50118fb4699d/src/backend/parser/scan.l#L327-L328 https://github.com/postgres/postgres/blob/f033462d8f77c40b7d6b33c5116e50118fb4699d/src/backend/parser/scan.l#L202
Parsing C-style comments
Absorbing SQL comments into whitespace:
I'm not a db guy, but if comments are just --
and /* */
it should be simple enough to write a comment parser that runs in tandem with the actual language parser, you'd then attach the comments to root.comments
on the AST and it should be printable.
cc @vjeux who knows a lot more about Prettier comments.
@azz is correct. Note that you just need a tokenizer of the language to extract out comments. For example you need to make sure that comments inside of strings are not treated as comments. But writing a tokenizer that extracts comments is trivial compared to a full parser.
Once you have all the comments and location, you can feed it to prettier and you’ll get a relatively good heuristic out of the box to print them
Awesome; can you point me at a simple-ish example? I’m also interested to know how the comments might work mid-statement - do we just attach them to the end of the statement or is there a way to maintain their association (e.g. with a specific column in a big “create table” statement)?
The logic to figure out where to attach nodes is here: https://github.com/prettier/prettier/blob/master/src/main/comments.js#L183
Each comment must be attached to an ast node a mode of either "before, after or inside". Then when we print each node, we automatically print the associated comments at the right place. So the print code is mostly free of comment handling.
In order to find where to attach comments, we use heuristics. For example if the comment is on its own line, we attach it "before" the next ast node. If it's the last token of the line and there's something before, we attach it "after" the previous ast node, etc...
Those simple heuristics work most of the time, but people are imaginative in where they put comments, so we have custom logic to figure out where to put them based on the ast structure.
That's really cool! I suspect it's going to take me a while to figure this all out, but at least I don't have to go writing hard to maintain patches against C-code!
What you just have to do is in the returned object from parser have a field comments
which is an array of comments with their position and it'll just work(tm). You can see how I did it for python: https://github.com/prettier/prettier-python/pull/9
When you find cases where it's not adding comments just right, now you'll dig into it.
Is there a particular format the that AST needs to adhere to? I'm trying to figure out how prettier knows what the comments relate to. In prettier-python
there seems to be start
/end
values on the comments at least (possibly on every node?); in the GraphQL parser it refers to line
and column
. In the PG AST parser we get location for very few AST nodes, in the form of position
(which is bytes since the beginning of the file); e.g.:
select 'hi';
gives the location of the value 'hi'
only:
[ { SelectStmt:
{ targetList:
[ { ResTarget:
{ val: { A_Const: { val: { String: { str: 'hi' } }, location: 7 } },
location: 7 } } ], // < Location here
op: 0 } } ]
The more complex document:
create function foo() returns text as $$ select 'hi'; $$ language sql;
create function foo() returns text language plv8 as $$ var a = 1; var b = 2; return a + " + " + b+"="+(((a + b))); $$;
create function foo() returns text language plv8 as $$
var a = 1;
var b = 2;
return a + " + " +
b+"="+(((a + b)));
$$;
gives the location of the return types (text
) and that is all.
[ { CreateFunctionStmt:
{ funcname: [ { String: { str: 'foo' } } ],
returnType:
{ TypeName:
{ names: [ { String: { str: 'text' } } ],
typemod: -1,
location: 30 } }, // < Location here
options:
[ { DefElem:
{ defname: 'as',
arg: [ { String: { str: ' select \'hi\'; ' } } ],
defaction: 0 } },
{ DefElem:
{ defname: 'language',
arg: { String: { str: 'sql' } },
defaction: 0 } } ] } },
{ CreateFunctionStmt:
{ funcname: [ { String: { str: 'foo' } } ],
returnType:
{ TypeName:
{ names: [ { String: { str: 'text' } } ],
typemod: -1,
location: 101 } }, // < Location here
options:
[ { DefElem:
{ defname: 'language',
arg: { String: { str: 'plv8' } },
defaction: 0 } },
{ DefElem:
{ defname: 'as',
arg:
[ { String:
{ str: ' var a = 1; var b = 2; return a + " + " + b+"="+(((a + b))); ' } } ],
defaction: 0 } } ] } },
{ CreateFunctionStmt:
{ funcname: [ { String: { str: 'foo' } } ],
returnType:
{ TypeName:
{ names: [ { String: { str: 'text' } } ],
typemod: -1,
location: 220 } }, // < Location here
options:
[ { DefElem:
{ defname: 'language',
arg: { String: { str: 'plv8' } },
defaction: 0 } },
{ DefElem:
{ defname: 'as',
arg:
[ { String:
{ str: '\nvar a = 1;\n var b = 2;\n return a + " + " +\nb+"="+(((a + b)));\n' } } ],
defaction: 0 } } ] } } ]
I'm assuming I'm going to have to mod the parser to add location information to more nodes in order to figure out where to attach the comments?
Right now the location extraction code is hardcoded here: https://github.com/prettier/prettier/blob/master/src/common/util.js#L241-L298
We need to move it to the plugin system, if you are interested in doing it that would be great!
Exciting news! I did some digging and it turns out that although the current version of pg-query-native
does not output statement position/length, the latest versions of libpg_query
do! And adding support to pg-query-native
is just a case of:
libpg_query
and then run MACOSX_DEPLOYMENT_TARGET="10.7" make
to build the latest binary for OS X (we also need to make one for Linux)node-pg-query-native
and overwrite libpg_query/linux/libpg_query.a
and libpg_query/include/pg_query.h
with the libpg_query.a
and pg_query.h
that were generated in step 1npm install
to install the deps and link against the new libraryNote that running npm test
at this point will fail because there's an extra level of indirection now. Lines like:
assert.equal(typeof query.parse("select 1").query[0].SelectStmt, "object");
Must now have RawStmt.stmt
injected, like this::
assert.equal(typeof query.parse("select 1").query[0].RawStmt.stmt.SelectStmt, "object");
Then the tests will pass again, and viola!
Note that statement positions in Postgres are determined by the locations of the semicolons, so select 1
won't have any, but select 1;
will. Note also that the first statement does not get a stmt_location
- you can assume that it's zero.
Here's an example of the generated AST for "\n\n\nselect 1; select 2;\nSelect 3;"
:
{ query:
[ { RawStmt:
{ stmt:
{ SelectStmt:
{ targetList:
[ { ResTarget:
{ val: { A_Const: { val: { Integer: { ival: 1 } }, location: 8 } },
location: 8 } } ],
op: 0 } },
stmt_len: 9 } },
{ RawStmt:
{ stmt:
{ SelectStmt:
{ targetList:
[ { ResTarget:
{ val: { A_Const: { val: { Integer: { ival: 2 } }, location: 18 } },
location: 18 } } ],
op: 0 } },
stmt_location: 10,
stmt_len: 9 } },
{ RawStmt:
{ stmt:
{ SelectStmt:
{ targetList:
[ { ResTarget:
{ val: { A_Const: { val: { Integer: { ival: 3 } }, location: 28 } },
location: 28 } } ],
op: 0 } },
stmt_location: 20,
stmt_len: 9 } } ],
stderr: '' }
Given this, we just need to write a comment scanner and try to attach the comments to the relevant places.
Here's the AST for
create table foo (
id int, -- no primary key because I just want a reason for a comment
bar text, -- nullable
baz text not null -- Baz never goes away
);
{ query:
[ { RawStmt:
{ stmt:
{ CreateStmt:
{ relation:
{ RangeVar: { relname: 'foo', inh: true, relpersistence: 'p', location: 13 } },
tableElts:
[ { ColumnDef:
{ colname: 'id',
typeName:
{ TypeName:
{ names:
[ { String: { str: 'pg_catalog' } },
{ String: { str: 'int4' } } ],
typemod: -1,
location: 24 } },
is_local: true,
location: 21 } },
{ ColumnDef:
{ colname: 'bar',
typeName:
{ TypeName:
{ names: [ { String: { str: 'text' } } ],
typemod: -1,
location: 96 } },
is_local: true,
location: 92 } },
{ ColumnDef:
{ colname: 'baz',
typeName:
{ TypeName:
{ names: [ { String: { str: 'text' } } ],
typemod: -1,
location: 120 } },
is_local: true,
constraints: [ { Constraint: { contype: 1, location: 125 } } ],
location: 116 } } ],
oncommit: 0 } },
stmt_len: 160 } } ],
stderr: '' }
did you get comments working by chance?
@benjie you may want to use my parser which uses a postgres 10 lib_pgquery out of the box! it's quite ahead of the main fork you can see here (currently 128 commits): https://github.com/pyramation/pg-query-parser/network
I've added a TON of statement types and also completely updated the testing system and made it easy to test different use cases. It has a series of SQL files to test against, and compares the results using ASTs, so it's very extensible as things are needed by different users.
I was looking at that only a couple days ago as it happens @pyramation - great work!
I'm thinking at the moment that we may need to handle comments via a separate parser as azz/vjeux comment; but I need to analyse what that parser needs to understand - e.g. select '/* something */';
and select $$ something /* */$$;
should not parse those string contents as comments; so it needs a basic understanding of some of the PostgreSQL syntax. Determining exactly how much of the syntax needs this is a challenge. It seems that the PG query tokeniser has no interest in representing comments (even though it counts comment depth) so we're going to have to roll our own. Adding it to PostgreSQL's parser seems attractive, but I fear maintaining the patch is going to be a big pain.
This would be more of a stopgap for comments, but what if a separate comment parser roughly extracted comments, then scanned the AST for partials that matched existing code? It’s not perfect, but it could be a start.
Yeah; that's what @azz suggested above - effectively running two parsers, one for comments and one for the rest of the language. We'll have to be careful that the comment parser can also parse SQL strings ($$
, $_$
, etc) so that it doesn't try and read the comments inside of function bodies (which may be in a different language) but I think it's quite viable. I've not looked into actually doing it yet though.
Currently all comments are dropped; e.g.
becomes:
If anyone feels like taking this on, one interesting fact about SQL comments is that, unlike JS (and C) comments, they nest. i.e. the following is completely ignored by the parser as a comment:
/*
outer comment
/* inner comment */
end of outer comment
*/
If the parser is able to drop comments, this means that it can parse them just fine. All prettier needs is the list of the comments and their location. It’s probably easier to tweak the parser to record them instead of dropping them fwiw.
I looked at having the PostgreSQL parser record the comments before, but it's way beyond me. If anyone wants to dig in I did a bunch of research last year that's right at the top of this thread (with links into the BISON stuff and the parser code). I even considered re-implementing their parser ourselves in one of the many JS parser implementations but decided that was going to be too much of a pain to maintain.
I think if we can maintain a patch on their parser that actually passes the comments through that would be our best bet, but I don't grok enough of the code to do this myself without sinking in significant time.
If there is any format issues you have that pgformatter doesn't do, I'd recommend filing issues here: https://github.com/darold/pgFormatter/issues as Darold is super fast! He knocked out about 7-8 issues I had with the formatter, and now I'm using it finally.
after that, I reached out to the vscode plugin maintainer, and he added a flag which pretty much makes the pgformat usable now https://github.com/bradymholt/vscode-pgFormatter/issues/13
Does it format the function bodies (plv8, plpython, etc?)
I think he made it so it doesn't format those and leaves them in tact... which leads me to think there is two potential solutions.
I don't really have the expertise on this one to chip in (in a meaningful way), but I can explain why my use case would depend on this;
Many 'modern' [1], [2] SQL migration tools use comments to deliniate the 'up' portion of a migration and the 'down' portion meaning that if comments were to be stripped then we'd lose the necessary identifiers for the tools to run - you might be all aware of this anyway but I suspect a lot of users of these sorts of tools would be interested to implement this into their workflow when complete!
This isn't to rush you guys, this is open source and seems like a complex issue just calling out why it might be an important one for many people - thanks for keeping up the effort!
Yeah, comments are vital - that’s the thing to solve before we can continue work on this project.
:grin:
👀
I spent the weekend experimenting with different approaches to prettifying postgres, then realized: comments! And thus found my way here.
I (and, in fact, the rest of the company I work for!) am very interested in figuring this out. I don't think I know enough C to maintain a major parser rewrite, but I'm happy to dig into the pg scan source and maybe learn some BISON. A cursory glance makes me hopeful: all of the symbols for comments and extended comments are already there, so it seems like the patch would be a matter of returning a proper symbol for comments and updating the grammar to match. (I realize, on rereading this thread, that you linked to all of this above! Whoops.)
While I'm here: I'd love for this to be TypeScript, too :)
That would certainly be welcome! A postgres parser that supports comments would be an excellent resource outside of this project, there are many uses for it 😁
Status update; I spent a few hours working on this this weekend, I've achieved:
I decided before writing a comment parser I'd like to see what prettier needs, so I tried adding a fake node like the reverse of this commit:
https://github.com/benjie/prettier-plugin-pg/commit/348f139c20665c352f64a1485cf3627d3aa5034c
This produces the error:
FAIL test tests/BEGIN/jsfmt.spec.js
● Test suite failed to run
Comment "-- Hello!" was not printed. Please report this error!
109 |
110 | function prettyprint(src, filename, options) {
> 111 | return prettier.format(
| ^
112 | src,
113 | Object.assign(
114 | {
at node_modules/prettier/index.js:14755:13
at Array.forEach (<anonymous>)
at ensureAllCommentsPrinted (node_modules/prettier/index.js:14753:15)
at coreFormat (node_modules/prettier/index.js:14801:3)
at format (node_modules/prettier/index.js:15019:75)
at formatWithCursor (node_modules/prettier/index.js:15035:12)
at node_modules/prettier/index.js:51620:12
at Object.format (node_modules/prettier/index.js:51640:12)
at prettyprint (tests_config/run_spec.js:111:19)
at tests_config/run_spec.js:37:22
at Array.forEach (<anonymous>)
at run_spec (tests_config/run_spec.js:24:27)
at Object.<anonymous> (tests/BEGIN/jsfmt.spec.js:1:90)
I'm not sure what I need to do next to have this comment output in the results.
Thank you for the update! I'm really looking forward to SQL support in Prettier. Inline SQL (sql
tag in JavaScript/TypeScript) is the only part of my code base that is formatted manually.
Here’s some ways you could help, if you were interested:
@vjeux
(I've tried to write this so you don't need to read the above.)
If you were to check out master and git revert 348f139c20665c352f64a1485cf3627d3aa5034c
(https://github.com/benjie/prettier-plugin-pg/commit/348f139c20665c352f64a1485cf3627d3aa5034c) you'd get the following test failure when you run the tests:
Comment "-- Hello!" was not printed. Please report this error!
109 |
110 | function prettyprint(src, filename, options) {
> 111 | return prettier.format(
| ^
112 | src,
113 | Object.assign(
114 | {
at node_modules/prettier/index.js:14755:13
at Array.forEach (<anonymous>)
at ensureAllCommentsPrinted (node_modules/prettier/index.js:14753:15)
at coreFormat (node_modules/prettier/index.js:14801:3)
at format (node_modules/prettier/index.js:15019:75)
at formatWithCursor (node_modules/prettier/index.js:15035:12)
at node_modules/prettier/index.js:51620:12
at Object.format (node_modules/prettier/index.js:51640:12)
at prettyprint (tests_config/run_spec.js:111:19)
at tests_config/run_spec.js:37:22
at Array.forEach (<anonymous>)
at run_spec (tests_config/run_spec.js:24:27)
at Object.<anonymous> (tests/BEGIN/jsfmt.spec.js:1:90)
You mentioned above that adding to ast.comments
should Just Work :tm:; are there some other prerequisites for this? Should I be doing something in my print
function to actually output the comments?
Please note:
Document
, the root node type, contains a list of RawStmts
, which then contain other AST nodes (which don't have location info)RawStmt
) does{ [nodeType]: { ...details } }
as a format; this is just how they come from Postgres (it seems prettier generally expects a flatter {type: nodeType, ...details}
shape).I tried source diving the language-graphql support, but it wasn't obvious to me what I was missing for PostgreSQL. Any light you can shed would be very helpful :pray:
You need to export a printComments
function like this:
https://github.com/prettier/prettier/blob/master/src/language-graphql/printer-graphql.js#L701
And likely a canAttachComments
function defined just below
@vjeux I have those: https://github.com/benjie/prettier-plugin-pg/blob/master/src/index.ts#L96
Perhaps my canAttachComment returns true for too few nodes? The RawStmt nodes should span the document and I have location data for them so I figured they would be a good choice.
After a little more digging, I think my suspicion about lack of location information on deeper nodes being the issue is likely; prettier determines if a comment is leading, trailing or dangling like this:
} else if (precedingNode) {
addTrailingComment(precedingNode, comment);
} else if (followingNode) {
addLeadingComment(followingNode, comment);
} else if (enclosingNode) {
addDanglingComment(enclosingNode, comment);
} else {
// There are no nodes, let's attach it to the root of the ast
/* istanbul ignore next */
addDanglingComment(ast, comment);
All of our comments are effectively detected as "dangling" since they're always within the bounds of a RawStmt. And these dangling comments are not output, presumably because prettier cannot figure out where to attach them. If I manually edit the location data for the RawStmt to start after the comment then the comment can be detected as a "leading" comment, and prettier then outputs it correctly.
I don't know how to solve this since we don't have sufficient location data for each AST node, and we don't want to just move all comments to the start/end of a statement as this would lose context, e.g.:
create table my_table (
id uuid primary key, -- Make it hard to guess the number of records
...
username text, -- Optional!
...
);
would become
-- Make it hard to guess the number of records
-- Optional
create table my_table (
id uuid primary key,
...
username text,
...
);
I'm not sure if this is solvable with the PostgreSQL parser as it is. We might have to write our own, and that's a significant undertaking!
Normally BISON gives you first line/first column, last line/last column:
https://www.gnu.org/software/bison/manual/bison.html#Location-Type
But unfortunately PostgreSQL saw fit to throw this information away:
This is frustrating, because PostgreSQL parses create table foo (
and create table "foo" (
identically, so we cannot know which the user entered nor determine the length of the node for ourselves.
Potentially we could look at modifying this in the PostgreSQL parser, but I think that might be a significant undertaking, and keeping it up to date with new PostgreSQL versions is going to be a pain.
If Postgres parses create table foo
the same as create table "foo"
, shouldn't a prettier convert the latter to the former anyway?
@rattrayalex The issue relates to knowing the length of the node (so that we know where to insert the comments) rather than choosing what the output should be.
Right, of course – my apologies. Yeah, hard to imagine working through this without some modifications to the parser. I wonder if the postgres community would be willing to help out. Perhaps they could provide a flag to output more detailed information, including comments and node locations.
It’s a big ask with the parser how it is; I think a custom parser is they way forward.
It seems that the native PostgreSQL query parser simply treats comments as whitespace (and drops them). See: https://github.com/lfittl/libpg_query/issues/15
To support comments we're either going to have to do something ugly with regexps (ick), write our own AST parser (ugh), or dive into the C-code of libpg_query, maybe starting around here: https://github.com/lfittl/libpg_query/blob/cd9e4b33cef948461c0354a81a23082b16c6a8d1/src/postgres/src_backend_parser_parser.c#L43-L70
This is a major issue. Without comments, we shouldn't really consider progressing.