BNFC / bnfc

BNF Converter
http://bnfc.digitalgrammars.com/
586 stars 165 forks source link

Keywords `token` and `layout` are apparently not compatible #372

Open MatthewDaggitt opened 3 years ago

MatthewDaggitt commented 3 years ago

When using the parser generated by the grammar fragment

position token TokLet      {"let"};
position token TokIn       {"in"};

ELet.       Expr  ::= TokLet "{" [Expr] "}" TokIn Expr;

layout "let";
layout stop "in";

to parse:

test =
  let y : Real
      y = f x ! 0
  in y

I get syntax error at line 2, column 8 before 'y'.

If instead I use the grammar:

ELet.       Expr  ::= "let" "{" [Expr] "}" "in" Expr;

layout "let";
layout stop "in";

then everything parses fine. The main problem with this is that I think(?) it means that I'm unable to get position information for the "let" and "in" tokens.

andreasabel commented 3 years ago

The BNFC philosophy is that tokens have some content you are interested in, whereas keywords like "let" have none. So the latter are not represented in the AST, and consequently have no position information.

One could boost the layout mechanism by allowing also declared tokens as layout keywords, i.e.,

layout TokLet;
layout stop TokIn;

Beyond that, one could detect choice-less tokens such as TokLet and represent them as unit types in the AST...

In general, it would be nice to have BNFC produce a concrete syntax tree (CST) and a procedure to extract the AST from the CST. But this is a quite radical redesign.

MatthewDaggitt commented 3 years ago

Thanks for the reply @andreasabel. So, without making keywords such as let a token, what is the procedure to extract the location of the let expression in the source code for error reporting purposes whilst adhering to the BNFC philosophy? At the moment my main motivating example is trying to work around #371, i.e. throw a meaningful error with source code locations if the list of let bindings is empty.

However, I think I must be misunderstanding about how to go about extracting positions generally. For example, if one were implementing a language server one would need to get positions for parse errors to highlight the location where the parsing fell over. But the pX functions generated in the Par module return Either String a with the positions already converted to the String. Am I misunderstanding something or is it something that BNFC does not support?

andreasabel commented 3 years ago

But the pX functions generated in the Par module return Either String a with the positions already converted to the String.

I think the error reporting API also needs a overhaul. The error location shouldn't be mangled into the error message.

In general, BNFC in its present state should be thought of as a wizard generating a parser and printer for prototyping rather than professional use. In the worst case, you can get a happy parser from BNFC that sort of does the thing and then hand-edit it to your satisfaction.

That said, in general I am looking to make BNFC more professional without breaking backwards-compatibility if possible, so a better API to the parser and better position information in the syntax tree are definitely a way to go forward.

andreasabel commented 3 years ago

One could boost the layout mechanism by allowing also declared tokens as layout keywords, i.e.,

layout TokLet;
layout stop TokIn;

I am getting doubts whether this would be a common use case or just feature bloat. It seems that the OP wants something like that only for getting position information, but providing more position information should rather be addressed more systematically.

wenkokke commented 3 years ago

Perhaps a BNFC flag which adds the full range of input tokens parsed to create this node to each CST node?

andreasabel commented 3 years ago

Perhaps a BNFC flag which adds the full range of input tokens parsed to create this node to each CST node?

Atm, we do not produce CSTs. We currently only attach the starting position to each node (with --functor).

I understand your proposal in this way: instead of just attaching a single position to each node, attach a list of start positions of each production item (be it terminal or nonterminal). E.g., a node for

ELet.  Expr ::= "let" Bind "in" Expr;

would look like

ELet :: [Position] -> Bind -> Expr -> Expr

example = ELet [posLet, posBind, posIn, posExpr] bind expr

This could be happening under a flag that addresses the parser.

wenkokke commented 3 years ago

More or less, though instead of only providing the first position of each token it'd be better to provide the entire range, either as a pair of Position or using Data.Range.

The reason is that you don't know the length of every token, so this information is difficult to reconstruct.

MatthewDaggitt commented 3 years ago

That would certainly address our current problems. I can't help but wonder though if it's a bit hacky though, for example you would have to pattern match on the list and there's no static guarantee that it would have the right number of elements. I guess it could be made a tuple of the correct arity rather than a list... I don't know, offering the full CST seems like a much more principled way to go, although I realise it's a lot more effort.