BNFC / bnfc

BNF Converter
http://bnfc.digitalgrammars.com/
578 stars 163 forks source link

Removal of zero-width matches in generated tree-sitter grammar files, preliminary implementation #475

Open chaserhkj opened 5 months ago

chaserhkj commented 5 months ago

Tree-sitter has a known limitation of not handling symbols that can match zero-width (a.k.a. "empty") strings. (Documented here) The only exception seems to be the start symbol.

However, in BNFC, zero-width matching symbols are supported and used in many examples and test cases. (e.g. The list macros in BNFC) It would be impractical to not support them in the tree-sitter backend.

It is possible to eliminate all zero-width matching symbols from a grammar: First, all symbols that could match zero-width need to be identified. Then these symbols can be converted to valid tree-sitter symbols by removing the RHS empty branches from its rules. Then all rules of the entire grammar need to be processed, wrapping all of the references to the aforementioned symbol with optional(..). Finally since the addition of optional(..) can populate zero-width-matching to other symbols, this needs to be done repeatedly until converging on a fixed point or the zero-width-matching is populated to the start symbol.

This PR implements a preliminary version of the above-described algorithm. The converging process is not done yet and currently, it only handles simple cases of zero-width-matching where there is only one layer of matching. It should be sufficient enough for most cases like the list-related internal macros provided in BNFC.

andreasabel commented 1 month ago

Thanks, @chaserhkj! Sorry for the delay, I was busy teaching in early Spring, and then lost track of this PR.

I gave the testing a try but it seems like most (all?) of the tree-sitter tests fail.

What would be the way to use this backend? Could you write some backend documentation at https://github.com/BNFC/bnfc/blob/master/docs/user_guide.rst ? Proposed content:

Also, we need some CHANGELOG entry.

andreasabel commented 1 month ago

So it seems that the fix here is not complete, e.g. with CF file

separator Integer "  "

and the grammar.js generated by bnfc --tree-sitter ...

$ tree-sitter generate ./grammar.js 
The rule `list_token_Integer` matches the empty string.

Tree-sitter does not support syntactic rules that match the empty string
unless they are used only as the grammar's start rule
chaserhkj commented 1 month ago

Sorry I wasn't able to look at this again in the past few months and I kinda forgot where I was.

I think I at least implemented the simplest cases, but I might be wrong. I might not have the immediate time to work on this, but I'll see if I could get back to this (and the documentations) in one week or two.