Accepting a broader a range of grammars and generating an AST with closely resembling structure

I've spent some time familiarising myself with Fling and I am positively impressed by your work. Naturally, I am starting to hit some limitations as I explore further. So the goal of this issue is to discuss one particular limitation. I'd like to know what you think and whether it is possible to overcome this limitation in Fling.

The limitation comes from the restriction on the class of accepted grammars, which must be LL. Here are some challenges that stem from it:

Any sophisticated grammar needs to be rewritten to fit the class of LL grammars. This is mostly a matter of programming ergonomics, and I can see how the translation could be implemented.
Generated ASTs directly correspond to their LL grammars. This challenge is the most significant one as it makes the generated AST difficult to work with: it is hard to reason about in the same way as LL grammars are hard to read. Moreover, there is an issue of evolvability. Certain changes that induce little semantic variation, such as operator precedence, might require significant changes to the LL grammar resulting in proportionate changes to the resulting AST.

For example, let's consider a context-free grammar for a language that has predicates and means of combining them:

Condition = Predicate | Condition 'and' Condition | Condition 'or' Condition | '(' Condition ')'
Predicate = ...

Notably, this grammar is not LL as it involves left recursion. To provide this grammar to Fling, it has first to be transformed:

Condition = Predicate Condition$$ | CompoundCondition
CompoundCondition = '(' Condition ')' Condition$$
Condition$ = 'or' Condition | 'and' Condition
Condition$$ = Condition$ Condition$$ | ε

Now in Fling:

derive(Condition).to(Predicate, Condition$$).or(CompoundCondition).
derive(CompoundCondition).to(begin, Condition, end, Condition$$).
derive(Condition$).to(or, Condition).or(and, Condition).
derive(Condition$$).to(Condition$, Condition$$).orNone().

And the resulting AST:

interface Condition {}
record Condition1(Predicate predicate, Condition$$ condition$$) implements Condition {}
record CompoundCondition(Condition condition, Condition$$ condition$$) implements Condition {}

interface Condition$ {}
record Condition$1(Condition condition) implements Condition$ {}
record Condition$2(Condition condition) implements Condition$ {}

interface Condition$$ {}
record Condition$$1(Condition$ condition$, Condition$$ condition$$) implements Condition$$ {}
record Condition$$2 implements Condition$$ {}

Clearly, such an AST is difficult to reason about because this is the level of LL grammars.

Ideally, an AST corresponding to the original grammar would have the following form:

interface Condition {}
record AndCondition(Condition left, Condition right) implements Condition {}
record OrCondition(Condition left, Condition right) implements Condition {}
record CompoundCondition(Condition cond) implements Condition {}
record Predicate(...) implements Condition {}

And the original grammar would be specified as follows:

specialize(Condition).into(AndCondition, OrCondition, CompoundCondition, Predicate)
derive(AndCondition).to(Condition, and, Condition).
derive(OrCondition).to(Condition, or, Condition).
derive(CompoundCondition).to(begin, Condition, end).
derive(Predicate).to(...).

Overall, it would be great to be able to:

Specify a context-free grammar as input to Fling (given that it can be transformed into LL).
Have the generated AST resemble the input grammar as closely as possible.

As I already mentioned, I know that CFG to LL translation is possible, so point 1 could be achieved. The requirement outlined in point 2, however, seems vastly more difficult to achieve. It would probably also require changes to the parsing algorithm. A mapping between the original and transformed grammars would need to be maintained and used to construct the desirable AST. I can only guess at possible approaches, so I strongly wish to hear what the authors of Fling think.

I am positively impressed by your work.

I'm happy to hear that.

The limitation comes from the restriction on the class of accepted grammars, which must be LL.

Take a look at the accompanied paper, where we presented a general solution for all deterministic context-free languages or, equivalently, all LR grammars. A few years have passed and there's now much more to be said about our algorithm. Especially, we completely missed the connection between our "tree embedding" and real-time pushdown tree automata a la Guessarian 1984. We chose to implement LL grammars because it's much easier. Yamazaki et al. 2019's embedding also supports LR grammars but implemented for LALR grammars.

I know that CFG to LL translation is possible, so point 1 could be achieved.

That's incorrect. LL grammars can express a strict subset of deterministic CFLs which is in turn a strict subset of CFLs.

Have the generated AST resemble the input grammar as closely as possible.

You get $$ variables because we use a naive namer (NaiveNamer) which doesn't make any smart attempts at guessing intermediate variable names. You could specify them explicitly though:

A := x y | w z

A := B | C

B := x y C := w z

That should work I think.

A better namer (Namer) could possibly infer better names automatically.

@OriRoth, I am familiar with the paper - it is through the paper that I found out about your work. However, I must say that I am yet to grasp the tree encoding you described.

Regarding my statement about the translation of CFG to LL. While it is technically incorrect, I rather meant that it is possible to translate a subset of CFLs that are expressible in an LL grammar, such as the one exemplified (predicates and combinations), not all CFLs of course.

You mention the issue with picking appropriate names for intermediate variables. However, that is not the central issue I was trying to raise, so let me be more clear about it. I consider the existence of AST classes corresponding to intermediate variables itself problematic. In the above example, I could explicitly name intermediate variables in my grammar or, as you suggest, provide a smarter Namer, but that will still lead to an obscure AST being generated. Therefore, the focus should not be on names but on the structure of the AST. That is what I tried to convey by providing an example of the desired AST, which I shall repeat below:

interface Condition {}
record AndCondition(Condition left, Condition right) implements Condition {}
record OrCondition(Condition left, Condition right) implements Condition {}
record CompoundCondition(Condition cond) implements Condition {}
record Predicate(...) implements Condition {}

This AST does not contain any intermediate variables. It directly corresponds to the original CFG (also exemplified in the issue description but not repeated in this comment) which is not LL, but the underlying language can be expressed in LL (as it was shown in the issue description), hence the need for grammar transformation. This poses a challenge for parsing and, ultimately, AST generation. The goal is for the AST to have a structure that closely resembles the original grammar; the LL grammar (a result of grammar transformation) would then remain an implementation detail.

Ok, I think I get it now. Quite interesting. Wouldn't supporting a different class of grammars, such as LR or LALR, alleviate the problem? It's possible in theory. It might also be interesting to see how this issue is approached in classical parsers that support LL grammars, such as ANTLR. If they came up with creative solutions, these may be applicable here as well.

OriRoth / fling

Accepting a broader a range of grammars and generating an AST with closely resembling structure #24