Open MatthewDaggitt opened 3 years ago
The BNFC philosophy is that token
s have some content you are interested in, whereas keywords like "let"
have none. So the latter are not represented in the AST, and consequently have no position information.
One could boost the layout mechanism by allowing also declared tokens as layout keywords, i.e.,
layout TokLet;
layout stop TokIn;
Beyond that, one could detect choice-less tokens such as TokLet
and represent them as unit types in the AST...
In general, it would be nice to have BNFC produce a concrete syntax tree (CST) and a procedure to extract the AST from the CST. But this is a quite radical redesign.
Thanks for the reply @andreasabel. So, without making keywords such as let
a token, what is the procedure to extract the location of the let
expression in the source code for error reporting purposes whilst adhering to the BNFC philosophy? At the moment my main motivating example is trying to work around #371, i.e. throw a meaningful error with source code locations if the list of let bindings is empty.
However, I think I must be misunderstanding about how to go about extracting positions generally. For example, if one were implementing a language server one would need to get positions for parse errors to highlight the location where the parsing fell over. But the pX
functions generated in the Par
module return Either String a
with the positions already converted to the String
. Am I misunderstanding something or is it something that BNFC does not support?
But the
pX
functions generated in thePar
module returnEither String a
with the positions already converted to theString
.
I think the error reporting API also needs a overhaul. The error location shouldn't be mangled into the error message.
In general, BNFC in its present state should be thought of as a wizard generating a parser and printer for prototyping rather than professional use. In the worst case, you can get a happy parser from BNFC that sort of does the thing and then hand-edit it to your satisfaction.
That said, in general I am looking to make BNFC more professional without breaking backwards-compatibility if possible, so a better API to the parser and better position information in the syntax tree are definitely a way to go forward.
One could boost the layout mechanism by allowing also declared tokens as layout keywords, i.e.,
layout TokLet; layout stop TokIn;
I am getting doubts whether this would be a common use case or just feature bloat. It seems that the OP wants something like that only for getting position information, but providing more position information should rather be addressed more systematically.
Perhaps a BNFC flag which adds the full range of input tokens parsed to create this node to each CST node?
Perhaps a BNFC flag which adds the full range of input tokens parsed to create this node to each CST node?
Atm, we do not produce CSTs. We currently only attach the starting position to each node (with --functor
).
I understand your proposal in this way: instead of just attaching a single position to each node, attach a list of start positions of each production item (be it terminal or nonterminal). E.g., a node for
ELet. Expr ::= "let" Bind "in" Expr;
would look like
ELet :: [Position] -> Bind -> Expr -> Expr
example = ELet [posLet, posBind, posIn, posExpr] bind expr
This could be happening under a flag that addresses the parser.
More or less, though instead of only providing the first position of each token it'd be better to provide the entire range, either as a pair of Position
or using Data.Range
.
The reason is that you don't know the length of every token, so this information is difficult to reconstruct.
That would certainly address our current problems. I can't help but wonder though if it's a bit hacky though, for example you would have to pattern match on the list and there's no static guarantee that it would have the right number of elements. I guess it could be made a tuple of the correct arity rather than a list... I don't know, offering the full CST seems like a much more principled way to go, although I realise it's a lot more effort.
When using the parser generated by the grammar fragment
to parse:
I get
syntax error at line 2, column 8 before 'y'
.If instead I use the grammar:
then everything parses fine. The main problem with this is that I think(?) it means that I'm unable to get position information for the "let" and "in" tokens.