Closed RexJaeschke closed 3 years ago
[From Neal Gafter in private mail to Rex on 2020-11-18, after having reviewed Rex's paper.]
We should not attempt to produce a grammar that will validate using the ANTLR toolkit. There are some parts of the syntax that cannot be neatly explained in the grammar, and would result in a grammar that definitely does not validate (i.e. it is ambiguous). https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/lexical-structure.md#725-grammar-ambiguities is a good example. Interpolated strings are another (separating the syntactic and lexical parts might make things less clear). See also https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/expressions.md#1288-cast-expressions. I think we should use the formalism of ANTLR to document our syntax with as much rigor as we have historically used.
If we wanted to do this [validate, that is], we probably would use parser modes for async
(different set of keywords inside an async method), LINQ (different set of keywords inside the query), and possibly interpolated strings.
As for unsafe
, it can be written
class_modifier
: 'unsafe'
;
because (if I'm not mistaken) you are allowed to have multiple rules for the same nonterminal, and they are merged. But the surrounding test should make it clear that this is in addition to other alternatives specified elsewhere.
We specify associativity and precedence by spelling out the structure of the expression grammar explicitly. We don't need to specify precedence or associativity using ANTLR features.
[Rex drafted the following text.]
Assuming we go this route, we'll need to add some appropriate text to 7.2, Grammars.
If we buy into Proposal 1 (https://github.com/ECMA-TC49-TG2/csharpstandard/issues/37#issuecomment-731419531), we avoid the following (the numbers in parens refer to sections in Rex's paper):
'<...>'
(3), such as the following:Not_Slash_Or_Asterisk
: '<Any Unicode character except / or *>'
;
Letter_Character
: '<A Unicode Character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
| '<A Unicode_Escape_Sequence representing a Char of classes Lu, Ll, …>'
;
fragment
rule name prefix (6)Not_Slash_Or_Asterisk
and Letter_Character
.The ANTLR convention is that a lexer rule name begins with an uppercase letter, and may be followed by uppercase and/or lowercase letters, digits, and underscores.
We do not currently begin lexer rule names with an uppercase letter, and we should. However, I think that using all-caps is distracting. And as we're used to having a dash separator in v5 and prior (and now underscore), it seems useful to continue having separators. The remaining question then is how to use case. Whatever the decisions made, it's a matter of running a set of mechanical conversions on the existing spec grammar fragments and rule names in the narrative to make these changes.
If we adopt this change, it will involve a lot of edits to the Lexical structure clause and a few in other clauses that refer to lexer rules. We can't simply use a find-and-replace-text tool as we have some lexer rule names that are commonly used English words in the spec (such as whitespace, comment, identifier, and keyword). So, we'll want to convert all lexer rule names inside a grammar block and between italic fences, as in rule_name, but not elsewhere.
Recommendation: We change to starting lexer rule names with an uppercase letter. As to whether we up-case the first letter in subsequent words is debatable. Here are the obvious styles:
• Integer_literal // first word leading cap
• Integer_Literal // all words leading cap [Rex's first choice]
• IntegerLiteral // all words leading cap, no separators
Before locking in a decision, we should look at parser rule names, and be consistent or deliberately different. Perhaps the simplest rule to follow here is “all lowercase except the initial letter of each word.” That is the style I use throughout this paper.
Resolved: During the telcon on 2021-01-13, we chose Recommendation 2, Integer_Literal, with all words having a leading uppercase letter, and words separated by underscore.
I would prefer if the ANTLR grammar was written so it would validate, I would have thought this was a key reason to switch to ANTLR notation. I haven’t looked into whether an ANTLR grammar that validates is possible, hence the “prefer” rather than “should”.
That said, if the grammar does not/cannot validate than I think the Standard must state this. Expecting an ANTLR grammar to validate is, at minimum, reasonable given the purpose of ANTLR; so the Standard should make it clear that ANTLR is used as a notation only – and then leave it at that as it does beg the question why the Standard uses a notation that cannot describe the language's syntax, and we probably shouldn’t raise that in the Standard text!
Note that if the limitations of ANTLR are restricted to such things as specifying “any character in Unicode class X” then that looks like a reasonable extension to ANTLR that could be defined (think of the various extended BNFs that exist, I’ve defined one myself when the need arose – though I’ve long since forgotten what that need was just that I had to write a document on the grammar notation! ;-)), but the Standard may not want to go there.
On 21/11/2020, at 10:36 am, Rex Jaeschke notifications@github.com wrote:
[From Neal Gafter in private mail to Rex on 2020-11-18, after having reviewed Rex's paper.]
We should not attempt to produce a grammar that will validate using the ANTLR toolkit. There are some parts of the syntax that cannot be neatly explained in the grammar, and would result in a grammar that definitely does not validate (i.e. it is ambiguous). https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/lexical-structure.md#725-grammar-ambiguities https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/lexical-structure.md#725-grammar-ambiguities is a good example. Interpolated strings are another (separating the syntactic and lexical parts might make things less clear). See also https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/expressions.md#1288-cast-expressions https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/expressions.md#1288-cast-expressions. I think we should use the formalism of ANTLR to document our syntax with as much rigor as we have historically used.
If we wanted to do this [validate, that is], we probably would use parser modes for async (different set of keywords inside an async method), LINQ (different set of keywords inside the query), and possibly interpolated strings.
As for unsafe, it can be written
class_modifier : 'unsafe' ; because (if I'm not mistaken) you are allowed to have multiple rules for the same nonterminal, and they are merged. But the surrounding test should make it clear that this is in addition to other alternatives specified elsewhere.
We specify associativity and precedence by spelling out the structure of the expression grammar explicitly. We don't need to specify precedence or associativity using ANTLR features.
[Rex drafted the following text.]
Proposal 1: We should use ANTLR notation with common conventions. However, we should not spend any effort on getting the grammar to validate with the ANTLR toolkit.
Assuming we go this route, we'll need to add some appropriate text to 7.2, Grammars.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ECMA-TC49-TG2/csharpstandard/issues/37#issuecomment-731419531, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSYVW2O2VXDE3TLU62P2BDSQ3ONTANCNFSM4T5ICDNA.
After the previous comment I glanced quickly at the ANTLR docs thinking it surely must handle Unicode character classes and it does, see https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md.
I've taken a brief look at ANTLR and I beleive the rules Rex quotes above can be written:
Not_Slash_Or_Asterisk
: ~[*/] // Any Unicode character except / or *
;
Letter_Character
: [\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}] // A Unicode Character of classes Lu, Ll, Lt, Lm, Lo, or Nl
| Unicode_Escape_Sequence { IsLetterCharacter() }? // as previous but using an escape sequence
;
The last rule includes a semantic predicate, { IsLetterCharacter() }?
this a way context-sensitive or other semantic checks can be included in the syntactical grammar. If the function IsLetterCharacter()
returns false the alternative is "unviable". Semantic predicates can be included in both parser and lexer rules. The predicates themselves can be defined in the same file as the grammar; are written in Java, C++, Javascript or any other language ANTLR supports; and have access to the source being parsed/lexed.
The above doesn't tell us that all the content-sensitive/semantic needs of the C# grammar can be supported, but I think we should find out – I have a certain unease in publishing a language Standard with a known invalid/incomplete grammar.
I'm definitely fine with making it clear that the grammar doesn't validate. If there's no way we can make it validate, for reasons expressed before (e.g. interpolated string literals and the grammar ambiguities in 7.2.5) I would be interested to know how feasible it would be to make it "nearly validate" - is there anything we can do to ensure that we don't create invalid rules by mistake? If ANTLR can list the ways in which it's invalid, and if that's a small-but-sensible list, we could perform validation within a GitHub action and compare the result against the expected set of failures.
Resolutions from Dec 16, 2020 meeting:
During the Dec 2020 teleconference, I took an action item to identify the ANTLR convention changes we could/should use. As a result, I have updated the first entry in this issue (https://github.com/ECMA-TC49-TG2/csharpstandard/issues/37#issue-747792336). My recommendations are shown with TBD in the table's Status column.
@RexJaeschke: I've just removed the stale meeting labels... do we want to come back to this in the January meeting, in which case I'll reapply them?
@jskeet: I'd very much like us to decide on Proposal 2 (https://github.com/ECMA-TC49-TG2/csharpstandard/issues/37#issuecomment-731433543) on the next call. Assuming we agree to make this change, I can then get started on implementing it in the base spec as well as in the V6 PRs that are impacted..
@RexJaeschke: Thanks - I've added the discuss label. Hopefully if we limit the scope to just that topic, it won't take much time.
We have decided to accept proposal 2, going with the Integer_Literal
style (so PascalCasing but with underscores as well).
For anyone who wants to play with the ANTLR toolkit, I have created the following document: ANTLR Grammar Validation.docx.
In 5.3 of my paper (mentioned at the very top of this issue), I discussed the possibility of using using some extra lexer rules, but as synonyms for individual tokens. I now argue in favor of us going that route. Specifically, by adding the following:
DEFAULT : 'default' ;
NULL : 'null' ;
TRUE : 'true' ;
FALSE : 'false' ;
ASTERISK : '*' ;
SLASH : '/' ;
[Note the naming convention: Such lexer rules are all-caps to distinguish them from "ordinary" lexer rules, which start with a single cap on each word.]
the multiple places where the expanded literal is used are replaced with the name,. For example:
Boolean_Literal
: TRUE
| FALSE
;
Keyword
...
| TRUE
...
;
Pp_Primary_Expression
: TRUE
| FALSE
...
;
As I described in Item 15 of my paper, this issue is a major blocking factor both for the lexer and syntactic grammar. In the next month or so, it would be good to have someone assigned to looking at this.
Currently, we define most of these extensions in unsafe-code.md using the following approach:
class_modifier
: …
| 'unsafe'
;
where that rule is defined in classes.md, as
class_modifier
: 'new'
| 'public'
| 'protected'
| 'internal'
| 'private'
| 'abstract'
| 'sealed'
| 'static'
;
As such, we cannot simply include in the grammar we input to the ANTLR toolkit the grammar from unsafe-code.md in a verbatim manner.
BTW, while most extensions involve the addition of an alternative line to an existing rule, several require changes to existing lines. And the extension grammar modifies rules in a number of md files, not just classes.md.
The challenge here is how to deal with this, long term, so we can extract two separate grammars, one core, the other, extended.
Some thoughts to consider:
//
-style ANTLR comments to the unsafe extensions, so they are easily discernable in the grammar blocksProposal 4: Probably ambivalent on this one myself, for punctuation it often makes sense, don't know how often true
/false
occur in the grammar offhand but there is no objection really to any key/reserved word getting its own lexical rule.
Proposal 5: No comment ;-) That said I'll probably look into this sometime...
Proposal 6: My first thought was "I wonder how hard hacking ANTLR is to support continuation productions..." :-)
At the moment the language is in terms of "unsafe extensions", making the proposed change might change it more to "unsafe support is optional" and if so maybe the proposed grammar presentation suggest wording changes should be considered as part of it.
C# effectively has two grammars, one with unsafe productions and one without, and that is how it is currently presented. When checking the grammar do not both grammars need to be checked independently?
At the moment we have to combine the "continuation" productions to get the grammar with optional parts, easily enough done by searching for ellipses, if we present a single grammar complete with optional parts ripping the latter out to check the grammar without optional parts might end up being more work?
On balance at the moment I'd keep the continuation productions, but its not an unchangeable view!
Meeting 2021-02-10:
Regarding proposal 5: my fork of ANTLR supports indirect left recursion elimination.
I need to review the other proposals above to see if I have other feedback on them.
Notes:
Proposal 1: Seems we decided to try and validate the result.
Proposal 2: Standard convention for lexer rules is PascalCase, but anything starting with uppercase will work.
Proposal 3: I'd need to see more specific examples here.
Proposal 4: This may have a negative impact on readability in some edge cases. If this is something we want to pursue, we should consider having a naming convention specifically for these keywords.
Proposal 5: I recommend using my fork of ANTLR for validation. This is the approach we used for some private validation of the Java language a few years back. The grammar might need work before it can be used in some targets (manual refactorings, etc.), but it will make it much easier to validate the grammar against the wording in the language specification. Put another way: we don't want a limitation in a parser generator to have a negative impact on overall clarify in the language specification, so if we have a way to avoid such a limitation we should leverage it to the degree it helps with readability.
Proposal 6: I agree with merging the rules back into the main chapters. The main chapters can call out portions as necessary using a statement that support for the form is optional. The grammar will inevitably be more relaxed than the language semantics, and this is often a desirable characteristic (e.g. for unsafe
, the grammar would always parse the keyword and expect a validation pass after parsing to report an error if the keyword is not supported by that compiler).
@MadsTorgersen , @sharwell , @Nigel-Ecma please review this.
Based on our decision in https://github.com/ECMA-TC49-TG2/csharpstandard/issues/37#issuecomment-777008557, here's how I understand we want the unsafe support to be spec'd.
§15.2.2 Class modifiers|15.2.2.1 General
Add to the end of the following grammar rule a forward reference (with comment) to the unsafe
modifier rule defined in Clause 23:
class_modifier : 'new' | 'public' | 'protected' | 'internal' | 'private' | 'abstract' | 'sealed' | 'static' | unsafe_modifier // unsafe support ;
After this rule, add the following:
The rule unsafe_modifier is defined in §23.2.
Do likewise for struct_modifier, interface_modifier, delegate_modifier, field_modifier, method_modifier, property_modifier, event_modifier, indexer_modifier, operator_modifier, and constructor_modifier, in their respective clauses.
In each of those clauses, review any forward mention of the unsafe clause in narrative to see if it belongs there or in Clause 23 instead. (§15 Classes has no such references, which is good.)
§23.2 Unsafe contexts
Replace the following:
The associated grammar extensions are shown below. For brevity, ellipses (...) are used to represent productions that appear in preceding clauses.
class_modifier : ... | 'unsafe' ; ... constructor_modifier : ... | 'unsafe' ;
with
The associated grammar extensions are shown below and in subsequent subclauses.
unsafe_modifier : 'unsafe' ;
§15.13 Finalizers
Change
finalizer_declaration : attributes? 'extern'? '~' Identifier '(' ')' finalizer_body ;
to
finalizer_declaration : attributes? 'extern'? unsafe_modifier? '~' identifier '(' ')' finalizer_body | attributes? unsafe_modifier? 'extern'? '~' identifier '(' ')' finalizer_body ;
After this rule, add the following:
The rule unsafe_modifier is defined in §23.2.
Do likewise for static_constructor_modifiers.
§23.2 Unsafe contexts
Delete the extended rules finalizer_declaration and static_constructor_modifiers, as these have been moved to 15.13.
Merge the following rules from Clause 23 into their corresponding chapters, adding a forward pointer to Clause 23, and remove those extended rules from Clause 23:
unmanaged_type
When I created Issue https://github.com/ECMA-TC49-TG2/csharpstandard/issues/211, I was confused by the need to refer to the unsafe clause to understand sizeof
's operand, unmanaged_type. That need seemed out-of-place to me then, as it still does now. After having studied §23.3, "Pointer types," at length, I am no longer confused!
Here's my rationale. The concept of an unmanaged type does not depend on the existence of unsafe code; it's part of the core language, so it shouldn't be introduced in Clause 23. Rather, in an unsafe context, the ability to have pointers simply extends the set of unmanaged types already allowed in the core language.
Here is the restructuring needed to achieve this:
Move the (following) grammar for unmanaged_type from unsafe-code.md to types.md:
unmanaged_type : type ;
Move the (following) description and constraints from unsafe-code.md to types.md, making the inclusion of the "Any pointer_type bullet conditional, as noted (and moving it to the end of the list):
An unmanaged_type is any type that isn't a reference_type, a type_parameter, or a constructed type, and contains no fields whose type is not an unmanaged_type. In other words, an unmanaged_type is one of the following:
I see two potential homes in types.md for the destination of both these moves: at the end of §9 Types|9.1 General, or in a new subclause, §9.x, "Unmanaged types," at the very end of the clause. I propose the latter, as its contents would be entirely about that topic and a better destination for all redirected references to that term rather than a reference to 9.1, which has other (and unrelated) stuff as well.
In §6, "Conformance", we state,
Conditionally normative text specifies a feature and its requirements where the feature is optional. However, if that feature is provided, its syntax and semantics shall be exactly as specified.
However, the only conditionally normative feature in the spec is the unsafe code clause. Of course, we might add other conditionally normative features in the future, so it's probably best we state implementer expectations re unsafe in the unsafe code clause.
In §23 Unsafe code|23.1 General, I propose we replace
An implementation that does not support unsafe code is required to diagnose any usage of the keyword
unsafe
.
with
An implementation that does not support unsafe code is required to diagnose any usage of the syntactic rules defined in this clause.
After all, it's not just the unsafe
keyword, it's the extensions to types, expressions, statements, and such, as covered later in Clause 23.
The edits proposed above have minimal impact on the spec: they are small and localized, and except for references to unmanaged_type, all existing references to Clause 23 stay the same. The edits also take care of my concerns in Issue 211.
One final tweak needed is that §12.7.11.7, "Anonymous object creation expressions," states,
The expression used in a member-declarator shall have a type. Thus, it is a compile-time error for an expression in a member-declarator to be
null
or an anonymous function. It is also a compile-time error for the expression to have an [[[unsafe type]]].
As we have no definition for "unsafe type," that will need to be changed. Perhaps it means it can't have a pointer type. Any thoughts?
PR #225 defines the following named lexer tokens, as proposed, and then references those names elsewhere in the lexer grammar, as needed:
DEFAULT : 'default' ;
NULL : 'null' ;
TRUE : 'true' ;
FALSE : 'false' ;
ASTERISK : '*' ;
SLASH : '/' ;
As it happens, 5 of these 6 (underlying) tokens are also used as terminals in the syntactic grammar, as follows:
'default'
is used as an operator, as a switch label, and as the destination of goto
'true'
and 'false'
are overloadable_unary_operators'*'
is an overloadable operator, the multiplication symbol, and part of pointer notation'/'
is is an overloadable operator, and the division symbolAs the ANTLR validator does not complain/warn about such duplicate terminals (like it does with those in the lexical grammar), and to avoid "polluting" the narrative of the document with the all-uppercase names, I did not replace any of these terminals with the newly defined uppercase names. Holler if you disagree with that.
@sharwell, you wrote, "Regarding proposal 5: my fork of ANTLR supports indirect left recursion elimination." From where can I get this version? I've found various places that talk about this fork, but none that shows me how to download/install it.
@sharwell
Regarding proposal 5: my fork of ANTLR supports indirect left recursion elimination.
Where can I easily get the jar file for your optimized fork? I'm happy to plug that in. Here's the code we use to run the validator now
<…>
-style Lexer Rule Alternatives@sharwell , @Nigel-Ecma please review this.
(See Nigel’s comment re this at https://github.com/ECMA-TC49-TG2/csharpstandard/issues/37#issuecomment-731862227.) Also see Issue https://github.com/ECMA-TC49-TG2/csharpstandard/issues/230.
We have about a dozen occurrences of alternatives consisting of one, two, or “any but one” character. The following recommendations show the current syntax followed by the proposed syntax:
Recommendation 1A: For printable characters, use that character, possibly with a comment if multiple glyphs might look similar, as follows:
'<_ the underscore character (U+005F)>'
'_' // the underscore character (U+005F)
Recommendation 1B: For non-printable characters having an ANTLR escape, optionally with comment, use that escape, as follows:
'<Line feed character (U+000A)>'
'\n' // Line feed
Recommendation 1C: For all other non-printable characters, use a Unicode sequence with comment, as follows:
'<Paragraph separator character (U+2029)>'
'\u2029' // Paragraph separator
BTW, ANTLR has extended its support for Unicode escapes; \uXXXX
has been augmented by \u{XXXXXX}
.
Recommendation 1D: For “any but” characters, use an ANTLR set, possibly with comment, as follows:
'<any character except ">'
~["] // any character except "
A number of alternatives reference other rules, including New_Line_Character
, SLASH
, or ASTERISK
. For example, in the case of
'<Any Unicode character except a New_Line_Character>'
the ideal replacement would seem to be
~New_Line_Character
Unfortunately, the operand of the not-operator (~
) is considered to be a sub-rule, and sub rules are not permitted to contain rule names (such as New_Line_Character
), so our only recourse is to use the expanded rule.
Recommendation 2: Use one of the following approaches instead:
~[\r\n\u0085\u2028\u2029] // anything but New_Line_Character
~('\r' | '\n' | '\u0085' | '\u2028' | '\u2029') // anything but New_Line_Character
A number of alternatives refer to Unicode class rules. (As reported by Nigel) ANTLR now has direct support for these. The following recommendations show the current syntax followed by the proposed syntax:
'<Any character with Unicode class Zs>'
[\p{Zs}] // Any character with Unicode class Zs
Zs
is the short name for the long name Space_Separator
(which may be used instead).
Note: A \p{…}
sequence can only appear in a set, even though this instance involves a set of 1!
FYI, uppercase \P
negates the selection.
Similarly,
'<A Unicode character of classes Mn or Mc>'
[\p{Mn}\p{Mc}] // category Mark, subcategories non-spacing and spacing combining
The following introduces general categories, which are groupings of related subcategories; the final alternative is the shortest:
'<A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
[\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}] // …
[\p{L}\p{Nl}] // category Letter, all subcategories; Category Number, subcategory letter
Recommendation 3: Use the shortest possible set of \p{…}
sequences with Unicode short names, possibly with a comment, as shown above.
Unicode_Escape_Sequence
Here, we’re dealing with the subtle difference between things like '\\u005F'
and '\u005F'
, the first being a 6-character string, the second being a single character.
Let’s begin with a simple example of the general problem; how can we rewrite the following?:
Underscore_Character
…
| '<A Unicode_Escape_Sequence representing the character U+005F>'
;
A rewrite is
Underscore_Character
…
| '\\u005F' // underscore string
| '\\u005f' // underscore string
;
We need to allow for either case for the hex digits a-f. Of course, if an allowed letter, decimal digit, or separator contains multiple hex digits a-f, we’d need four or more alternatives, which gets unwieldy. Fortunately, the Underscore_Character
rule is only used in one place, by Identifier_Start_Character
. (Underscores in other positions are handled as part of Connecting_Character
.)
Recommendation 4A: Show both spellings of the hex sequence for this case.
Here is the more general case: How to deal with the following?
'<A Unicode_Escape_Sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
To address this, Nigel proposed using the following
Letter_Character … | Unicode_Escape_Sequence { IsLetterCharacter() }? // … ;
The last rule includes a semantic predicate,
{ IsLetterCharacter() }?
this a way context-sensitive or other semantic checks can be included in the syntactical grammar. If the functionIsLetterCharacter()
returns false the alternative is "unviable". Semantic predicates can be included in both parser and lexer rules. The predicates themselves can be defined in the same file as the grammar; are written in Java, C++, Javascript or any other language ANTLR supports; and have access to the source being parsed/lexed.
After a brief read about semantic predicates, and some testing, I found that Nigel’s approach validates.
This approach would also need corresponding semantic predicate alternatives, like the following:
Combining_Character
: …
| Unicode_Escape_Sequence { IsCombiningCharacter() }?
;
Decimal_Digit_Character
: …
| Unicode_Escape_Sequence { IsDecimalDigitCharacter() }?
;
Connecting_Character
: …
| Unicode_Escape_Sequence { IsConnectingCharacter() }?
;
Formatting_Character
: …
| Unicode_Escape_Sequence { IsFormattingCharacter() }?
;
While this approach validates, we’d probably have to add a general note up front re grammars as well as one in the identifier lexer rule section, saying what these things are. Although we don’t have a goal of actually using ANTLR to create a lexer or parser, for testing purposes, I do find it useful to do so, at least on a subset of the language. However, to do that, I’d have to add (at least skeletal) definitions for these methods.
I see two ways to go here:
<...>
, in general)Recommendation 4B: What approach do we want?
We have two alternatives that cannot (easily) be expressed in a grammar:
'<An Identifier_Or_Keyword that is not a Keyword>'
'<Any Identifier_Or_Keyword except true or false>'
My first exposure to grammars was with the C Standard, in which each section having grammar puts that in a Syntax subsection, which is optionally followed by a Constraints subsection. So, I think of each exception in the examples above as being a constraint.
I see three ways to go here:
These are illustrated below, in order:
'<Any Identifier_Or_Keyword except true or false>'
Identifier_Or_Keyword // But NOT true or false
Identifier_Or_Keyword // see constraint below
Constraint: The Identifier_Or_Keyword
in Conditional_Symbol
shall not be true
or false
.
Recommendation 5: What approach do we want?
Following comments on #225 & #230:
Recommendation 1B: skip using ANTLR escapes and go straight to Unicode
Recommendation 1D: allow sets or alternates as in Recommendation 2, the subjective choice based on readability
Recommendation 2: drop the ANTLR escapes but otherwise allow either
Recommendation 3: Yes use the short names, I think they are in more common usage
Recommendation 4A: Either both spellings or follow 4B and use semantic predicates – choose subjectively
Recommendation 4B: Semantic predicates, so we have a grammar which works. We can provide C# code for the predicates as I understand ANTLR can output C# (I haven't confirmed that and was wondering whether adding Java code for them would be a good look or not ;-))
Recommendation 5: I'd say semantic predicates again. Wherever we use a semantic predicate we can name it and/or add a comment to indicate its purpose - { ButNotTrueOrFalse() }?
or { IsIdentifierOrKeywordOtherThanTrueAndFalse() }?
maybe
(Note to myself as much as anyone else.) Presumably in 4A we could use [fF]
to avoid show two entirely separate options?
Other than that note, I agree with Nigel. Presumably for recommendation 5, we really need to use something more than textual constraints if we want the grammar to validate correctly?
Where can I easily get the jar file for your optimized fork? I'm happy to plug that in.
It's on Maven Central here: https://search.maven.org/artifact/com.tunnelvisionlabs/antlr4/4.9.0/jar
It looks like the direct download link is: https://repo1.maven.org/maven2/com/tunnelvisionlabs/antlr4/4.9.0/antlr4-4.9.0-complete.jar
Item 1-3 seem fine for me.
4A: '\\u005F'
is easy, but there is a combinatorics problem for each additional letter. I recommend using '\\u005' [fF]
instead.
4B: A semantic predicate is the least ugly option here.
5: At this point we're talking more about parser rules than lexer rules.
An Identifier_Or_Keyword that is not a Keyword
I would recommend rewriting this as:
availableIdentifier
: Identifier
| contextualKeyword
;
contextualKeyword
: // itemized list of all keyword-like tokens that are not keywords
;
Any Identifier_Or_Keyword except true or false
Handling this depends on whether we treat pre-processing as part of the normal lexer pass or as a separate pass before the lexer. If it's all one pass, we could use this:
conditionalKeyword
: 'true' | 'false'
;
keyword
: conditionalKeyword
| // all the keywords except true and false
;
Re Proposal 6 (unsafe grammar), on this week's call we agreed that "unsafe type" should be "pointer" type. With that change, Rex's proposal was accepted, and he'll turn that into a PR.
Regarding Proposal 3, "<...>" (https://github.com/dotnet/csharpstandard/issues/37#issuecomment-789755202), I've taken into account input from Nigel (https://github.com/dotnet/csharpstandard/issues/37#issuecomment-793154408) and Sam (https://github.com/dotnet/csharpstandard/issues/37#issuecomment-796165854).
The handling of the Unicode sequences in identifiers results in the following grammar, with current alternatives to be replaced commented out and followed by their replacement(s):
Letter_Character
// : '<A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
// | '<A Unicode_Escape_Sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
: [\p{L}\p{Nl}] // category Letter, all subcategories; category Number, subcategory letter
| Unicode_Escape_Sequence { IsLetterCharacter() }?
;
Combining_Character
// : '<A Unicode character of classes Mn or Mc>'
// | '<A Unicode_Escape_Sequence representing a character of classes Mn or Mc>'
: [\p{Mn}\p{Mc}] // category Mark, subcategories non-spacing and spacing combining
| Unicode_Escape_Sequence { IsCombiningCharacter() }?
;
Decimal_Digit_Character
// : '<A Unicode character of the class Nd>'
// | '<A Unicode_Escape_Sequence representing a character of the class Nd>'
: [\p{Nd}] // category Number, subcategory decimal digit
| Unicode_Escape_Sequence { IsDecimalDigitCharacter() }?
;
Connecting_Character
// : '<A Unicode character of the class Pc>'
// | '<A Unicode_Escape_Sequence representing a character of the class Pc>'
: [\p{Pc}] // category Punctuation, subcategory connector
| Unicode_Escape_Sequence { IsConnectingCharacter() }?
;
Formatting_Character
// : '<A Unicode character of the class Cf>'
// | '<A Unicode_Escape_Sequence representing a character of the class Cf>'
: [\p{Cf}] // category Other, subcategory format
| Unicode_Escape_Sequence { IsFormattingCharacter() }?
;
While the definitions of the Is*
predicate methods need to be in the grammar file input to ANTLR when validation occurs (as part of Bill's actions), I see a way to smuggle them in without having to include them in the spec.
This leaves the final two uses of the `'<...>' notation
Available_Identifier
// : '<An Identifier_Or_Keyword that is not a Keyword>'
: Identifier_Or_Keyword { IsNotAKeyword() }?
;
Conditional_Symbol
// : '<Any Identifier_Or_Keyword except true or false>'
: Identifier_Or_Keyword { IsNotTrueOrFalse() }?
;
Nigel and I support the use of semantic predicates here as well, which I have shown above. Sam proposed some rearrangement of the grammar instead. @sharwell, is the semantic predicate approach OK with you?
How does this all look to you, @Nigel-Ecma?
@MadsTorgersen, @sharwell, @Nigel-Ecma
There are three lexer rules that have this problem, all in the preprocessor grammar, w.r.t Pp_Expressions involving ||
, &&
, ==
, and !=
.
On a whim, I rewrote those rules, as follows, with each commented-out line being replaced by the one following, as follows:
Pp_Or_Expression
: Pp_And_Expression
// | Pp_Or_Expression Whitespace? '||' Whitespace? Pp_And_Expression
| Whitespace? Pp_And_Expression '||' Pp_Or_Expression Whitespace?
;
Pp_And_Expression
: Pp_Equality_Expression
// | Pp_And_Expression Whitespace? '&&' Whitespace? Pp_Equality_Expression
| Whitespace? Pp_Equality_Expression '&&' Pp_And_Expression Whitespace?
;
Pp_Equality_Expression
: Pp_Unary_Expression
// | Pp_Equality_Expression Whitespace? '==' Whitespace? Pp_Unary_Expression
| Whitespace? Pp_Unary_Expression '==' Pp_Equality_Expression Whitespace?
// | Pp_Equality_Expression Whitespace? '!=' Whitespace? Pp_Unary_Expression
| Whitespace? Pp_Unary_Expression '!=' Pp_Equality_Expression Whitespace?
;
All I did was simply swap the two operands over, and it validated! But are the new forms equivalent/correct?
All I did was simply swap the two operands over, and it validated! But are the new forms equivalent/correct?
It changes the grouping/associativity, e.g. given a || b || c one version parses it is (a || b) || c the other as a || (b || c)
Does it matter in this case (obviously it can matter in some cases)?
(Sorry no answer right now, I’m distracting myself clearing email when I should be doing something else which I must get back to…)
On 16/04/2021, at 3:05 am, Rex Jaeschke @.***> wrote:
Proposal 5: Resolving Mutual Left-Recursion issue
@MadsTorgersen https://github.com/MadsTorgersen, @sharwell https://github.com/sharwell, @Nigel-Ecma https://github.com/Nigel-Ecma There are three lexer rules that have this problem, all in the preprocessor grammar, w.r.t Pp_Expressions involving ||, &&, ==, and !=.
On a whim, I rewrote those rules, as follows, with each commented-out line being replaced by the one following, as follows:
Pp_Or_Expression : Pp_And_Expression
// | Pp_Or_Expression Whitespace? '||' Whitespace? Pp_And_Expression | Whitespace? Pp_And_Expression '||' Pp_Or_Expression Whitespace? ;
Pp_And_Expression : Pp_Equality_Expression
// | Pp_And_Expression Whitespace? '&&' Whitespace? Pp_Equality_Expression | Whitespace? Pp_Equality_Expression '&&' Pp_And_Expression Whitespace? ;
Pp_Equality_Expression : Pp_Unary_Expression
// | Pp_Equality_Expression Whitespace? '==' Whitespace? Pp_Unary_Expression | Whitespace? Pp_Unary_Expression '==' Pp_Equality_Expression Whitespace?
// | Pp_Equality_Expression Whitespace? '!=' Whitespace? Pp_Unary_Expression | Whitespace? Pp_Unary_Expression '!=' Pp_Equality_Expression Whitespace?
; All I did was simply swap the two operands over, and it validated! But are the new forms equivalent/correct?— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dotnet/csharpstandard/issues/37#issuecomment-820502636, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSYVWYGM4PWKLW2OA4YHG3TI36DPANCNFSM4T5ICDNA.
This Issue was closed automatically when the unsafe grammar PR was merged, but we still have the left-recursive situation to deal with, so am re-opening.
Issues #259 & #260 raise questions which are dependent on this issue and investigations into the grammar it has triggered. Those questions should be considered in any resolution of this one.
fragment
Previously, it seemed that applying fragment
to a lexer rule might be useful, but not necessary. However, as Nigel discovered, it really should be used on all lexer rules that help other rules. That is, any lexer rule that does not result in a complete token should be marked fragment
. For example:
Decimal_Integer_Literal // forms a complete token; is not a helper rule
: Decimal_Digit+ Integer_Type_Suffix?
;
fragment
Decimal_Digit // helper rule; should have fragment prefix
: '0'..'9'
;
fragment
Integer_Type_Suffix // helper rule; should have fragment prefix
: 'U' | 'u' | 'L' | 'l' | 'UL' | 'Ul' | 'uL' | 'ul' | 'LU' | 'Lu' | 'lU' | 'lu'
;
As such, Nigel will add this prefix, as appropriate, as part of his efforts.
Nigel is handling this as part of his grammar overhaul PRs.
@Nigel-Ecma I'm inclined to close this Issue, unless you want to keep it open as a hook for things you are still working on.
I think closing it is fine, you could reference the various PR’s now merged (interpolated strings mentioned in #37 are still in progress but a PR will come)
Almost all of the outstanding tasks were handled by Nigel in various PRs, including #339, #342, and #351. Closing this Issue with Nigel's consent.
On 2020-10-03, I circulated the following paper: ANTLR and the C# Spec.docx. In that, I discussed using the ANTLR grammar notation and the non-trivial amount of work I thought it would take to get the grammar to validate using the ANTLR toolkit.
Neal has reviewed that paper and his feedback follows in a comment below, with a proposal on how we might go forward. Once we discuss (and probably agree with) his suggestion, I'll update this comment by adding the remaining questions and to keep track of our decisions on those questions .
Here are the ANTLR-related questions and their status:
fragment
When you add a comment to this issue, please limit it to a single Proposal # and say which Proposal # that is.
Any decisions we make that require changes to the grammar notation will not only have to be made to the v5 grammar we're starting with in v6, but also possibly to the v6 (and v7) PRs in waiting.