Open nikomatsakis opened 2 years ago
I've started on this.
I was thinking we could do it using some kind of parser generator tool that supports LL(k) or perhaps (better) GLL/GLR. Looking around at what I see on crates.io, parol looks like the best fit, and most actively maintained.
Of these two, LL(k) would be better, as it enforces non-ambiguity.
https://github.com/brson/dada/blob/grammar/components/dada-reference-grammar/src/grammar.par
It passes all but 28 of 107 tests. I'm struggling with it though since I don't understand how to write grammars well. So far parol says what I've written is LL2.
parol has a neat feature that it can emit random valid syntax, which could be useful for testing.
Now it's only 9 tests failing, all related to format strings (which I haven't tried to parse yet); and return without value and unit/tuples. The latter two both seem ambiguous to me.
Return without expression because if the expression is optional then it is not possible to have any expression-statements after a return statement
e.g.
fn foo() {
return
dead_code_here // can't decide what this is?
}
Unit/tuples, at least in statement position, are ambiguous with function calls per https://github.com/dada-lang/dada/issues/117
I am though very happy for you to correct me and fix my grammar definition. I don't understand this subject well.
Nice! I'll make some time to check this out.
The parser has to be generated manually right now and the command for that is here:
https://github.com/brson/dada/blob/grammar/components/dada-reference-grammar/src/lib.rs#L3
I added two xtasks for grammar so you don't have to type the full commands:
cargo xtask grammar build
cargo xtask grammar decidable
which shows the maximum lookahead33 tests failing now. I'll give it another look soon. (edit: looks to be mostly unary ops, that i haven't implemented yet).
Well I fixed the problem with return-without-expr by making it only possible to appear as the final expression in a block.
@brson whoops, forgot about the return
question. To be honest, I hadn't thought about it, but that rule (must be last in block) makes sense. Presumably that would apply to break
and continue
? I would probably have expected to use newline sensitivity -- i.e., there must be some token on the same line as return
(could be an opening (
).
If we're going to be newline-sensitive for disambiguation elsewhere then we should probably do it for return/break/continue as well.
Unasigning myself. Busy with other things lately.
Regarding
fn foo() {
return
bar
}
the current parser parses it as return with value bar
, which I think it makes sense, because you can do:
fn foo() {
return,
bar
}
if you want a return without value.
I've been experimenting with tree-sitter here: https://github.com/vemoo/dada/blob/tree-sitter/tree-sitter-dada/grammar.js
I'm not sure if tree-sitter is a good idea for defining a reference grammar, but it wasn't to difficult to create the grammar and it's parsing most of the dada_tests files but with some differences.
I'm now taking a look at the work done by @brson and experimenting with it, but I've found that some changes to the grammar file make parol hang, but I'm not sure if its an issue in the grammar or a bug in parol.
Thinking some more about this, it is a bit weird that newlines sometimes are interpreted as separators (fields, parameters, tuples, sometimes expressions) and other times as trivia.
Maybe newlines should not be automatically skipped in the parser, and should only be skipped where they are not ambiguous?
Maybe -- I was hoping to make newlines kind of magically "dwim" at all the right times, but I wouldn't be surprised if it doesn't always do what is expected.
I've been trying to complete the parol grammar here: https://github.com/vemoo/dada/tree/grammar but I'm finding it much harder to work with than tree-sitter. I think it's because a lot of the changes I try cause the grammar to no longer be LL(k), so that may be a LL(k) vs GLR difference.
Another thing, I've noticed that there are a few issues that affect the grammar: #177, #156, #134, #117 Should we create a label for them? To know that they could affect the reference grammmar.
https://github.com/brson/dada/blob/grammar/components/dada-reference-grammar/src/grammar.par
It passes all but 28 of 107 tests. I'm struggling with it though since I don't understand how to write grammars well. So far parol says what I've written is LL2.
Our parser is hand-written and I'm ok with that, but it'd be nice to have a reference grammar, particularly one that enforces non-ambiguity.
I was thinking we could do it using some kind of parser generator tool that supports LL(k) or perhaps (better) GLL/GLR. Looking around at what I see on crates.io, parol looks like the best fit, and most actively maintained.
I definitely want something on crates.io so that we can make a
dada-reference-grammar
crate, runcargo test
on it, and so forth, without having to install any external tools.