dotnet / vblang

The home for design of the Visual Basic .NET programming language and runtime library.
286 stars 63 forks source link

Spec clarification: downloadable grammar cannot be used by ANTLR to generate a parser #281

Open zspitz opened 6 years ago

zspitz commented 6 years ago

The grammar available from here (link at bottom of page) is supposed to be in ANTLR format. However, ANTLR reports many warnings and errors:

warning(184): d:\Zev\Projects\vbnet-grammer\vb.g4:187:0: One of the token Keyword values unreachable. REM is always overlapped by token CommentMarker warning(184): d:\Zev\Projects\vbnet-grammer\vb.g4:485:0: One of the token CCBinaryOperator values unreachable. & is always overlapped by token LongTypeCharacter warning(184): d:\Zev\Projects\vbnet-grammer\vb.g4:1386:0: One of the token OverloadableOperator values unreachable. & is always overlapped by token LongTypeCharacter warning(184): d:\Zev\Projects\vbnet-grammer\vb.g4:238:0: One of the token BooleanLiteral values unreachable. False is always overlapped by token Keyword warning(184): d:\Zev\Projects\vbnet-grammer\vb.g4:238:0: One of the token BooleanLiteral values unreachable. True is always overlapped by token Keyword ... error(119): d:\Zev\Projects\vbnet-grammer\vb.g4::: The following sets of rules are mutually left-recursive [CCExpression, CCOperatorExpression] and [Expression, IsExpression, MemberAccessBase, DictionaryAccessExpression, InvocationExpression, IndexExpression, AdditionOperatorExpression, SubtractionOperatorExpression, MultiplicationOperatorExpression, FPDivisionOperatorExpression, IntegerDivisionOperatorExpression, ModuloOperatorExpression, ExponentOperatorExpression, RelationalOperatorExpression, LikeOperatorExpression, ConcatenationOperatorExpression, ShortCircuitLogicalOperatorExpression, LogicalOperatorExpression, ShiftOperatorExpression, XMLMemberAccessExpression, TypeExpression, MemberAccessExpression, OperatorExpression, ArithmeticOperatorExpression, DivisionOperatorExpression]

Since ANTLR is a standard format for defining language grammars, I suggest modifying the grammar to make it ANTLR4 compatible.

Alternatively, the format should not be described as ANTLR.

sharwell commented 6 years ago

:bulb: My fork of ANTLR 4 may be able to unravel the left recursion error without modifying the grammar. It has initial support for automatic indirect left-recursion elimination.

zspitz commented 6 years ago

@sharwell Perhaps, but it actually seems as though the document author meant to say "The grammar makes use of ANTLR syntax in some places", as there a a number of other issues with the grammar besides for the mutual left-recursion, such as not using a different lexical mode to describe XML literals, or rules like the following:

Character:
    '<Any Unicode character except a LineTerminator>'
    ;
ljw1004 commented 6 years ago

ANTLR means two things: (1) a syntax for writing grammars, along with tools and editors that do simple well-formedness checks on grammars written in that syntax, (2) a parser-generator which takes grammars written in that syntax and produces a parser for it.

My goal in writing in ANTLR format was to get assistance from (1) to ensure that the grammar as written in the C# and VB specifications was correct. As for (2) using it to automatically produce a parser? That was very definitely a non-goal. There's indeed no point in it -- doing so would produce a parser that's inferior to the one in Roslyn, and isn't quite as accurate, and has worse error recovery and worse error messages, and lags behind it.

If we tried to conform to the rule of ANTLR's parser-generator, it would actually harm the primary goal. That's because C# and VB have grammars that are purposefully ambiguous in order to optimize for human readability. A human can read a simple three-line production, and refer to the accompanying prose to figure out how it applies. But usually, if we tried to encoded the same disambiguation into the grammar itself, then those simple three lines would mushroom into fifty confusing lines.

Similarly with mutual-left-recursion and overlapping tokens. Sure they're a problem for the ANTLR parser-generator. But they're not a problem for a human who reads the spec. If we turned the grammar into something that ANTLRs parser-generator could use, then we'd get a longer and more confusing spec.

In practical terms, these goals boil down to just "The goal is that ANTLRWorks should show no squiggles when editing the grammars as used in the C# and VB language specification documents. It is a non-goal that GenerateParser should have no errors." So we get benefits like these...

I do accept that it's really useful to have a parser for VB and C#. I just don't think that we get benefit from having that parser be generated by ANTLR. We're better off with the one that ships with Roslyn.

zspitz commented 6 years ago

@ljw1004

(I would like to apologize if I came across rather harshly.)

I understand and agree with what you are saying. I therefore suggest the following additions to the spec document:

I think that just saying Here is an ANTLR grammar leaves things open to confusion. I know I fully expected to download the grammar and be able to use all the features in my ANTLR environment of choice (VS Code with this extension, which is still in active development unlike ANTLR Works; and I highly recommend it on its own merits). In hindsight, many of these unavailable features are specific to the task of building a parser; but a disclaimer would have been very useful.