dotnet / csharpstandard

Working space for ECMA-TC49-TG2, the C# standard committee.
Creative Commons Attribution 4.0 International
717 stars 86 forks source link

ANTLR: Deciding on how far to go with this #37

Closed RexJaeschke closed 3 years ago

RexJaeschke commented 3 years ago

On 2020-10-03, I circulated the following paper: ANTLR and the C# Spec.docx. In that, I discussed using the ANTLR grammar notation and the non-trivial amount of work I thought it would take to get the grammar to validate using the ANTLR toolkit.

Neal has reviewed that paper and his feedback follows in a comment below, with a proposal on how we might go forward. Once we discuss (and probably agree with) his suggestion, I'll update this comment by adding the remaining questions and to keep track of our decisions on those questions .

Here are the ANTLR-related questions and their status:

Proposal # Status Description
1 Yes it should Decide whether our grammar should be complete enough to validate (3 Rex's paper). Note that validation here means "get through the ANTLR tool without error," which doesn't necessarily mean that any parser generated will be complete/correct. This is especially true for semantic predicates.
2 Resolved, PRs #208, #209 Change case of lexer rule names (5.1 Rex's paper)
3 Resolved, PR #261 Expanding lexer rule alternatives of the form <...> (3 Rex's paper)
4 Resolved, PR #225 Limited use of token names (5.3 Rex's paper)
5 In-progress Resolve Mutual Left-Recursion issue (15 Rex's paper)
6 Resolved, PR #233 How to handle the unsafe extensions
7 Resolved Marking lexer helper rules with fragment
8 Resolved Ultimately, revise 7.2, "Grammars" in lexical-structure.md to say what we do/don't do w.r.t full grammar validation

When you add a comment to this issue, please limit it to a single Proposal # and say which Proposal # that is.

Any decisions we make that require changes to the grammar notation will not only have to be made to the v5 grammar we're starting with in v6, but also possibly to the v6 (and v7) PRs in waiting.

RexJaeschke commented 3 years ago

[From Neal Gafter in private mail to Rex on 2020-11-18, after having reviewed Rex's paper.]

We should not attempt to produce a grammar that will validate using the ANTLR toolkit. There are some parts of the syntax that cannot be neatly explained in the grammar, and would result in a grammar that definitely does not validate (i.e. it is ambiguous). https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/lexical-structure.md#725-grammar-ambiguities is a good example. Interpolated strings are another (separating the syntactic and lexical parts might make things less clear). See also https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/expressions.md#1288-cast-expressions. I think we should use the formalism of ANTLR to document our syntax with as much rigor as we have historically used.

If we wanted to do this [validate, that is], we probably would use parser modes for async (different set of keywords inside an async method), LINQ (different set of keywords inside the query), and possibly interpolated strings.

As for unsafe, it can be written

class_modifier
    : 'unsafe'
    ;

because (if I'm not mistaken) you are allowed to have multiple rules for the same nonterminal, and they are merged. But the surrounding test should make it clear that this is in addition to other alternatives specified elsewhere.

We specify associativity and precedence by spelling out the structure of the expression grammar explicitly. We don't need to specify precedence or associativity using ANTLR features.

[Rex drafted the following text.]

Proposal 1: We should use ANTLR notation with common conventions. However, we should not spend any effort on getting the grammar to validate with the ANTLR toolkit.

Assuming we go this route, we'll need to add some appropriate text to 7.2, Grammars.

RexJaeschke commented 3 years ago

If we buy into Proposal 1 (https://github.com/ECMA-TC49-TG2/csharpstandard/issues/37#issuecomment-731419531), we avoid the following (the numbers in parens refer to sections in Rex's paper):

Not_Slash_Or_Asterisk
    : '<Any Unicode character except / or *>'
    ;

Letter_Character
    : '<A Unicode Character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
    | '<A Unicode_Escape_Sequence representing a Char of classes Lu, Ll, …>'
    ;
RexJaeschke commented 3 years ago

Proposal 2: Change the spelling of lexer rule names to use a leading uppercase letter on each word and an underscore as a word separator, as in Not_Slash_Or_Asterisk and Letter_Character.

The ANTLR convention is that a lexer rule name begins with an uppercase letter, and may be followed by uppercase and/or lowercase letters, digits, and underscores.

We do not currently begin lexer rule names with an uppercase letter, and we should. However, I think that using all-caps is distracting. And as we're used to having a dash separator in v5 and prior (and now underscore), it seems useful to continue having separators. The remaining question then is how to use case. Whatever the decisions made, it's a matter of running a set of mechanical conversions on the existing spec grammar fragments and rule names in the narrative to make these changes.

If we adopt this change, it will involve a lot of edits to the Lexical structure clause and a few in other clauses that refer to lexer rules. We can't simply use a find-and-replace-text tool as we have some lexer rule names that are commonly used English words in the spec (such as whitespace, comment, identifier, and keyword). So, we'll want to convert all lexer rule names inside a grammar block and between italic fences, as in rule_name, but not elsewhere.

Recommendation: We change to starting lexer rule names with an uppercase letter. As to whether we up-case the first letter in subsequent words is debatable. Here are the obvious styles: • Integer_literal // first word leading capInteger_Literal // all words leading cap [Rex's first choice]IntegerLiteral // all words leading cap, no separators

Before locking in a decision, we should look at parser rule names, and be consistent or deliberately different. Perhaps the simplest rule to follow here is “all lowercase except the initial letter of each word.” That is the style I use throughout this paper.

Resolved: During the telcon on 2021-01-13, we chose Recommendation 2, Integer_Literal, with all words having a leading uppercase letter, and words separated by underscore.

Nigel-Ecma commented 3 years ago

I would prefer if the ANTLR grammar was written so it would validate, I would have thought this was a key reason to switch to ANTLR notation. I haven’t looked into whether an ANTLR grammar that validates is possible, hence the “prefer” rather than “should”.

That said, if the grammar does not/cannot validate than I think the Standard must state this. Expecting an ANTLR grammar to validate is, at minimum, reasonable given the purpose of ANTLR; so the Standard should make it clear that ANTLR is used as a notation only – and then leave it at that as it does beg the question why the Standard uses a notation that cannot describe the language's syntax, and we probably shouldn’t raise that in the Standard text!

Note that if the limitations of ANTLR are restricted to such things as specifying “any character in Unicode class X” then that looks like a reasonable extension to ANTLR that could be defined (think of the various extended BNFs that exist, I’ve defined one myself when the need arose – though I’ve long since forgotten what that need was just that I had to write a document on the grammar notation! ;-)), but the Standard may not want to go there.

On 21/11/2020, at 10:36 am, Rex Jaeschke notifications@github.com wrote:

[From Neal Gafter in private mail to Rex on 2020-11-18, after having reviewed Rex's paper.]

We should not attempt to produce a grammar that will validate using the ANTLR toolkit. There are some parts of the syntax that cannot be neatly explained in the grammar, and would result in a grammar that definitely does not validate (i.e. it is ambiguous). https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/lexical-structure.md#725-grammar-ambiguities https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/lexical-structure.md#725-grammar-ambiguities is a good example. Interpolated strings are another (separating the syntactic and lexical parts might make things less clear). See also https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/expressions.md#1288-cast-expressions https://github.com/ECMA-TC49-TG2/conversion-to-markdown/blob/v6-draft/expressions.md#1288-cast-expressions. I think we should use the formalism of ANTLR to document our syntax with as much rigor as we have historically used.

If we wanted to do this [validate, that is], we probably would use parser modes for async (different set of keywords inside an async method), LINQ (different set of keywords inside the query), and possibly interpolated strings.

As for unsafe, it can be written

class_modifier : 'unsafe' ; because (if I'm not mistaken) you are allowed to have multiple rules for the same nonterminal, and they are merged. But the surrounding test should make it clear that this is in addition to other alternatives specified elsewhere.

We specify associativity and precedence by spelling out the structure of the expression grammar explicitly. We don't need to specify precedence or associativity using ANTLR features.

[Rex drafted the following text.]

Proposal 1: We should use ANTLR notation with common conventions. However, we should not spend any effort on getting the grammar to validate with the ANTLR toolkit.

Assuming we go this route, we'll need to add some appropriate text to 7.2, Grammars.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ECMA-TC49-TG2/csharpstandard/issues/37#issuecomment-731419531, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSYVW2O2VXDE3TLU62P2BDSQ3ONTANCNFSM4T5ICDNA.

Nigel-Ecma commented 3 years ago

After the previous comment I glanced quickly at the ANTLR docs thinking it surely must handle Unicode character classes and it does, see https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md.

Nigel-Ecma commented 3 years ago

I've taken a brief look at ANTLR and I beleive the rules Rex quotes above can be written:

Not_Slash_Or_Asterisk
    : ~[*/] // Any Unicode character except / or *
    ;

Letter_Character
    : [\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}] // A Unicode Character of classes Lu, Ll, Lt, Lm, Lo, or Nl
    | Unicode_Escape_Sequence { IsLetterCharacter() }? // as previous but using an escape sequence
    ;

The last rule includes a semantic predicate, { IsLetterCharacter() }? this a way context-sensitive or other semantic checks can be included in the syntactical grammar. If the function IsLetterCharacter() returns false the alternative is "unviable". Semantic predicates can be included in both parser and lexer rules. The predicates themselves can be defined in the same file as the grammar; are written in Java, C++, Javascript or any other language ANTLR supports; and have access to the source being parsed/lexed.

The above doesn't tell us that all the content-sensitive/semantic needs of the C# grammar can be supported, but I think we should find out – I have a certain unease in publishing a language Standard with a known invalid/incomplete grammar.

jskeet commented 3 years ago

I'm definitely fine with making it clear that the grammar doesn't validate. If there's no way we can make it validate, for reasons expressed before (e.g. interpolated string literals and the grammar ambiguities in 7.2.5) I would be interested to know how feasible it would be to make it "nearly validate" - is there anything we can do to ensure that we don't create invalid rules by mistake? If ANTLR can list the ways in which it's invalid, and if that's a small-but-sensible list, we could perform validation within a GitHub action and compare the result against the expected set of failures.

jskeet commented 3 years ago

Resolutions from Dec 16, 2020 meeting:

RexJaeschke commented 3 years ago

During the Dec 2020 teleconference, I took an action item to identify the ANTLR convention changes we could/should use. As a result, I have updated the first entry in this issue (https://github.com/ECMA-TC49-TG2/csharpstandard/issues/37#issue-747792336). My recommendations are shown with TBD in the table's Status column.

jskeet commented 3 years ago

@RexJaeschke: I've just removed the stale meeting labels... do we want to come back to this in the January meeting, in which case I'll reapply them?

RexJaeschke commented 3 years ago

@jskeet: I'd very much like us to decide on Proposal 2 (https://github.com/ECMA-TC49-TG2/csharpstandard/issues/37#issuecomment-731433543) on the next call. Assuming we agree to make this change, I can then get started on implementing it in the base spec as well as in the V6 PRs that are impacted..

jskeet commented 3 years ago

@RexJaeschke: Thanks - I've added the discuss label. Hopefully if we limit the scope to just that topic, it won't take much time.

jskeet commented 3 years ago

We have decided to accept proposal 2, going with the Integer_Literal style (so PascalCasing but with underscores as well).

RexJaeschke commented 3 years ago

Getting Started with the ANTLR Toolkit

For anyone who wants to play with the ANTLR toolkit, I have created the following document: ANTLR Grammar Validation.docx.

RexJaeschke commented 3 years ago

Proposal 4: Make limited use of token names

In 5.3 of my paper (mentioned at the very top of this issue), I discussed the possibility of using using some extra lexer rules, but as synonyms for individual tokens. I now argue in favor of us going that route. Specifically, by adding the following:

DEFAULT  : 'default' ;
NULL     : 'null' ;
TRUE     : 'true' ;
FALSE    : 'false' ;
ASTERISK : '*' ;
SLASH    : '/' ;

[Note the naming convention: Such lexer rules are all-caps to distinguish them from "ordinary" lexer rules, which start with a single cap on each word.]

the multiple places where the expanded literal is used are replaced with the name,. For example:

Boolean_Literal
    : TRUE
    | FALSE
    ;
Keyword
    ...
    | TRUE
    ...
    ;
Pp_Primary_Expression
    : TRUE
    | FALSE
    ...
    ;
RexJaeschke commented 3 years ago

Proposal 5: Resolving Mutual Left-Recursion issue

As I described in Item 15 of my paper, this issue is a major blocking factor both for the lexer and syntactic grammar. In the next month or so, it would be good to have someone assigned to looking at this.

RexJaeschke commented 3 years ago

Proposal 6: Handling the unsafe extensions

Currently, we define most of these extensions in unsafe-code.md using the following approach:

class_modifier
    : …
    | 'unsafe'
    ;

where that rule is defined in classes.md, as

class_modifier
    : 'new'
    | 'public'
    | 'protected'
    | 'internal'
    | 'private'
    | 'abstract'
    | 'sealed'
    | 'static'
    ;

As such, we cannot simply include in the grammar we input to the ANTLR toolkit the grammar from unsafe-code.md in a verbatim manner.

BTW, while most extensions involve the addition of an alternative line to an existing rule, several require changes to existing lines. And the extension grammar modifies rules in a number of md files, not just classes.md.

The challenge here is how to deal with this, long term, so we can extract two separate grammars, one core, the other, extended.

I propose that we merge the unsafe grammar extensions directly into the corresponding places in the grammar blocks of the early chapters

Some thoughts to consider:

Nigel-Ecma commented 3 years ago

Proposal 4: Probably ambivalent on this one myself, for punctuation it often makes sense, don't know how often true/false occur in the grammar offhand but there is no objection really to any key/reserved word getting its own lexical rule.

Proposal 5: No comment ;-) That said I'll probably look into this sometime...

Proposal 6: My first thought was "I wonder how hard hacking ANTLR is to support continuation productions..." :-)

At the moment the language is in terms of "unsafe extensions", making the proposed change might change it more to "unsafe support is optional" and if so maybe the proposed grammar presentation suggest wording changes should be considered as part of it.

C# effectively has two grammars, one with unsafe productions and one without, and that is how it is currently presented. When checking the grammar do not both grammars need to be checked independently?

At the moment we have to combine the "continuation" productions to get the grammar with optional parts, easily enough done by searching for ellipses, if we present a single grammar complete with optional parts ripping the latter out to check the grammar without optional parts might end up being more work?

On balance at the moment I'd keep the continuation productions, but its not an unchangeable view!

jskeet commented 3 years ago

Meeting 2021-02-10:

sharwell commented 3 years ago

Regarding proposal 5: my fork of ANTLR supports indirect left recursion elimination.

I need to review the other proposals above to see if I have other feedback on them.

sharwell commented 3 years ago

Notes:

Proposal 1: Seems we decided to try and validate the result. Proposal 2: Standard convention for lexer rules is PascalCase, but anything starting with uppercase will work. Proposal 3: I'd need to see more specific examples here. Proposal 4: This may have a negative impact on readability in some edge cases. If this is something we want to pursue, we should consider having a naming convention specifically for these keywords. Proposal 5: I recommend using my fork of ANTLR for validation. This is the approach we used for some private validation of the Java language a few years back. The grammar might need work before it can be used in some targets (manual refactorings, etc.), but it will make it much easier to validate the grammar against the wording in the language specification. Put another way: we don't want a limitation in a parser generator to have a negative impact on overall clarify in the language specification, so if we have a way to avoid such a limitation we should leverage it to the degree it helps with readability. Proposal 6: I agree with merging the rules back into the main chapters. The main chapters can call out portions as necessary using a statement that support for the form is optional. The grammar will inevitably be more relaxed than the language semantics, and this is often a desirable characteristic (e.g. for unsafe, the grammar would always parse the keyword and expect a validation pass after parsing to report an error if the keyword is not supported by that compiler).

RexJaeschke commented 3 years ago

Proposal 6: Handling the unsafe extensions

@MadsTorgersen , @sharwell , @Nigel-Ecma please review this.

Based on our decision in https://github.com/ECMA-TC49-TG2/csharpstandard/issues/37#issuecomment-777008557, here's how I understand we want the unsafe support to be spec'd.

Stand-Alone Unsafe Modifiers

§15.2.2 Class modifiers|15.2.2.1 General

Add to the end of the following grammar rule a forward reference (with comment) to the unsafe modifier rule defined in Clause 23:

class_modifier
    : 'new'
    | 'public'
    | 'protected'
    | 'internal'
    | 'private'
    | 'abstract'
    | 'sealed'
    | 'static'
    | unsafe_modifier   // unsafe support
    ;

After this rule, add the following:

The rule unsafe_modifier is defined in §23.2.

Do likewise for struct_modifier, interface_modifier, delegate_modifier, field_modifier, method_modifier, property_modifier, event_modifier, indexer_modifier, operator_modifier, and constructor_modifier, in their respective clauses.

In each of those clauses, review any forward mention of the unsafe clause in narrative to see if it belongs there or in Clause 23 instead. (§15 Classes has no such references, which is good.)

§23.2 Unsafe contexts

Replace the following:

The associated grammar extensions are shown below. For brevity, ellipses (...) are used to represent productions that appear in preceding clauses.

class_modifier
    : ...
    | 'unsafe'
    ;
...
constructor_modifier
    : ...
    | 'unsafe'
    ;

with

The associated grammar extensions are shown below and in subsequent subclauses.

unsafe_modifier
    : 'unsafe'
    ;

Embedded Unsafe Modifiers

§15.13 Finalizers

Change

finalizer_declaration
    : attributes? 'extern'? '~' Identifier '(' ')' finalizer_body
    ;

to

finalizer_declaration
    : attributes? 'extern'? unsafe_modifier? '~' identifier '(' ')' finalizer_body
    | attributes? unsafe_modifier? 'extern'? '~' identifier '(' ')' finalizer_body
    ;

After this rule, add the following:

The rule unsafe_modifier is defined in §23.2.

Do likewise for static_constructor_modifiers.

§23.2 Unsafe contexts

Delete the extended rules finalizer_declaration and static_constructor_modifiers, as these have been moved to 15.13.

Other Grammar Changes

Merge the following rules from Clause 23 into their corresponding chapters, adding a forward pointer to Clause 23, and remove those extended rules from Clause 23:

  1. Extend the rule embedded_statement to include unsafe_statement and fixed_statement.
  2. Add pointer_type to type and non-array_type.
  3. Add pointer_member_access and pointer_element_access to primary_no_array_creation_expression.
  4. Add pointer_indirection_expression and addressof_expression to unary_expression.
  5. Add fixed_size_buffer_declaration to struct_member_declaration.
  6. Add stackalloc_initializer to local_variable_initializer.

Rule unmanaged_type

When I created Issue https://github.com/ECMA-TC49-TG2/csharpstandard/issues/211, I was confused by the need to refer to the unsafe clause to understand sizeof's operand, unmanaged_type. That need seemed out-of-place to me then, as it still does now. After having studied §23.3, "Pointer types," at length, I am no longer confused!

I propose that the definition of unmanaged_type, its description, and its contraints on which types are unmanaged_types, be moved to the Types clause, and that all references to that rule be changed to point there.

Here's my rationale. The concept of an unmanaged type does not depend on the existence of unsafe code; it's part of the core language, so it shouldn't be introduced in Clause 23. Rather, in an unsafe context, the ability to have pointers simply extends the set of unmanaged types already allowed in the core language.

Here is the restructuring needed to achieve this:

Move the (following) grammar for unmanaged_type from unsafe-code.md to types.md:

unmanaged_type
    : type
    ;

Move the (following) description and constraints from unsafe-code.md to types.md, making the inclusion of the "Any pointer_type bullet conditional, as noted (and moving it to the end of the list):

An unmanaged_type is any type that isn't a reference_type, a type_parameter, or a constructed type, and contains no fields whose type is not an unmanaged_type. In other words, an unmanaged_type is one of the following:

  • sbyte, byte, short, ushort, int, uint, long, ulong, char, float, double, decimal, or bool.
  • Any enum_type.
  • Any user-defined struct_type that is not a constructed type and contains fields of unmanaged_types only.
  • In unsafe code (§23.2), any pointer_type (§23.3).

I see two potential homes in types.md for the destination of both these moves: at the end of §9 Types|9.1 General, or in a new subclause, §9.x, "Unmanaged types," at the very end of the clause. I propose the latter, as its contents would be entirely about that topic and a better destination for all redirected references to that term rather than a reference to 9.1, which has other (and unrelated) stuff as well.

Stating Implementation Requirements re Support for Unsafe Code

In §6, "Conformance", we state,

Conditionally normative text specifies a feature and its requirements where the feature is optional. However, if that feature is provided, its syntax and semantics shall be exactly as specified.

However, the only conditionally normative feature in the spec is the unsafe code clause. Of course, we might add other conditionally normative features in the future, so it's probably best we state implementer expectations re unsafe in the unsafe code clause.

In §23 Unsafe code|23.1 General, I propose we replace

An implementation that does not support unsafe code is required to diagnose any usage of the keyword unsafe.

with

An implementation that does not support unsafe code is required to diagnose any usage of the syntactic rules defined in this clause.

After all, it's not just the unsafe keyword, it's the extensions to types, expressions, statements, and such, as covered later in Clause 23.

Conclusion

The edits proposed above have minimal impact on the spec: they are small and localized, and except for references to unmanaged_type, all existing references to Clause 23 stay the same. The edits also take care of my concerns in Issue 211.

One final tweak needed is that §12.7.11.7, "Anonymous object creation expressions," states,

The expression used in a member-declarator shall have a type. Thus, it is a compile-time error for an expression in a member-declarator to be null or an anonymous function. It is also a compile-time error for the expression to have an [[[unsafe type]]].

As we have no definition for "unsafe type," that will need to be changed. Perhaps it means it can't have a pointer type. Any thoughts?

RexJaeschke commented 3 years ago

Proposal 4: What was NOT changed

PR #225 defines the following named lexer tokens, as proposed, and then references those names elsewhere in the lexer grammar, as needed:

DEFAULT  : 'default' ;
NULL     : 'null' ;
TRUE     : 'true' ;
FALSE    : 'false' ;
ASTERISK : '*' ;
SLASH    : '/' ;

As it happens, 5 of these 6 (underlying) tokens are also used as terminals in the syntactic grammar, as follows:

As the ANTLR validator does not complain/warn about such duplicate terminals (like it does with those in the lexical grammar), and to avoid "polluting" the narrative of the document with the all-uppercase names, I did not replace any of these terminals with the newly defined uppercase names. Holler if you disagree with that.

RexJaeschke commented 3 years ago

@sharwell, you wrote, "Regarding proposal 5: my fork of ANTLR supports indirect left recursion elimination." From where can I get this version? I've found various places that talk about this fork, but none that shows me how to download/install it.

BillWagner commented 3 years ago

@sharwell

Regarding proposal 5: my fork of ANTLR supports indirect left recursion elimination.

Where can I easily get the jar file for your optimized fork? I'm happy to plug that in. Here's the code we use to run the validator now

RexJaeschke commented 3 years ago

Proposal 3: Dealing with <…>-style Lexer Rule Alternatives

@sharwell , @Nigel-Ecma please review this.

(See Nigel’s comment re this at https://github.com/ECMA-TC49-TG2/csharpstandard/issues/37#issuecomment-731862227.) Also see Issue https://github.com/ECMA-TC49-TG2/csharpstandard/issues/230.

Item 1: Easy Fixes

We have about a dozen occurrences of alternatives consisting of one, two, or “any but one” character. The following recommendations show the current syntax followed by the proposed syntax:

Recommendation 1A: For printable characters, use that character, possibly with a comment if multiple glyphs might look similar, as follows:

'<_ the underscore character (U+005F)>'
'_'    // the underscore character (U+005F)

Recommendation 1B: For non-printable characters having an ANTLR escape, optionally with comment, use that escape, as follows:

'<Line feed character (U+000A)>'
'\n'    // Line feed

Recommendation 1C: For all other non-printable characters, use a Unicode sequence with comment, as follows:

'<Paragraph separator character (U+2029)>'
'\u2029'    // Paragraph separator

BTW, ANTLR has extended its support for Unicode escapes; \uXXXX has been augmented by \u{XXXXXX}.

Recommendation 1D: For “any but” characters, use an ANTLR set, possibly with comment, as follows:

'<any character except ">'
~["]    // any character except "

Item 2: Simple but Inelegant Fixes

A number of alternatives reference other rules, including New_Line_Character, SLASH, or ASTERISK. For example, in the case of

'<Any Unicode character except a New_Line_Character>'

the ideal replacement would seem to be

~New_Line_Character

Unfortunately, the operand of the not-operator (~) is considered to be a sub-rule, and sub rules are not permitted to contain rule names (such as New_Line_Character), so our only recourse is to use the expanded rule.

Recommendation 2: Use one of the following approaches instead:

~[\r\n\u0085\u2028\u2029]   // anything but New_Line_Character
~('\r' | '\n' | '\u0085' | '\u2028' | '\u2029')   // anything but New_Line_Character

Item 3: Unicode Characters in Identifiers

A number of alternatives refer to Unicode class rules. (As reported by Nigel) ANTLR now has direct support for these. The following recommendations show the current syntax followed by the proposed syntax:

'<Any character with Unicode class Zs>'
[\p{Zs}]   // Any character with Unicode class Zs

Zs is the short name for the long name Space_Separator (which may be used instead).

Note: A \p{…} sequence can only appear in a set, even though this instance involves a set of 1!

FYI, uppercase \P negates the selection.

Similarly,

'<A Unicode character of classes Mn or Mc>'
[\p{Mn}\p{Mc}]   // category Mark, subcategories non-spacing and spacing combining

The following introduces general categories, which are groupings of related subcategories; the final alternative is the shortest:

'<A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
[\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}]   // …
[\p{L}\p{Nl}]   // category Letter, all subcategories; Category Number, subcategory letter

Recommendation 3: Use the shortest possible set of \p{…} sequences with Unicode short names, possibly with a comment, as shown above.

Item 4: Identifiers Containing Unicode_Escape_Sequence

Here, we’re dealing with the subtle difference between things like '\\u005F' and '\u005F', the first being a 6-character string, the second being a single character.

Let’s begin with a simple example of the general problem; how can we rewrite the following?:

Underscore_Character
    …
    | '<A Unicode_Escape_Sequence representing the character U+005F>'
;

A rewrite is

Underscore_Character
    …
    | '\\u005F' // underscore string
    | '\\u005f' // underscore string
    ;

We need to allow for either case for the hex digits a-f. Of course, if an allowed letter, decimal digit, or separator contains multiple hex digits a-f, we’d need four or more alternatives, which gets unwieldy. Fortunately, the Underscore_Character rule is only used in one place, by Identifier_Start_Character. (Underscores in other positions are handled as part of Connecting_Character.)

Recommendation 4A: Show both spellings of the hex sequence for this case.

Here is the more general case: How to deal with the following?

'<A Unicode_Escape_Sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'

To address this, Nigel proposed using the following

Letter_Character
    …
    | Unicode_Escape_Sequence { IsLetterCharacter() }? // …
    ;

The last rule includes a semantic predicate, { IsLetterCharacter() }? this a way context-sensitive or other semantic checks can be included in the syntactical grammar. If the function IsLetterCharacter() returns false the alternative is "unviable". Semantic predicates can be included in both parser and lexer rules. The predicates themselves can be defined in the same file as the grammar; are written in Java, C++, Javascript or any other language ANTLR supports; and have access to the source being parsed/lexed.

After a brief read about semantic predicates, and some testing, I found that Nigel’s approach validates.

This approach would also need corresponding semantic predicate alternatives, like the following:

Combining_Character
    : …
    | Unicode_Escape_Sequence { IsCombiningCharacter() }? 
    ;

Decimal_Digit_Character
    : …
    | Unicode_Escape_Sequence { IsDecimalDigitCharacter() }? 
    ;

Connecting_Character
    : …
    | Unicode_Escape_Sequence { IsConnectingCharacter() }? 
    ;

Formatting_Character
    : …
    | Unicode_Escape_Sequence { IsFormattingCharacter() }? 
    ;

While this approach validates, we’d probably have to add a general note up front re grammars as well as one in the identifier lexer rule section, saying what these things are. Although we don’t have a goal of actually using ANTLR to create a lexer or parser, for testing purposes, I do find it useful to do so, at least on a subset of the language. However, to do that, I’d have to add (at least skeletal) definitions for these methods.

I see two ways to go here:

  1. Leave it as is (and not explain how one might deal with this in an ANTLR-generated lexer, but up-front do explain the use of <...>, in general)
  2. Add semantic predicates and accompanying notes as discussed above

Recommendation 4B: What approach do we want?

Item 5: Other Challenges

We have two alternatives that cannot (easily) be expressed in a grammar:

'<An Identifier_Or_Keyword that is not a Keyword>'
'<Any Identifier_Or_Keyword except true or false>'

My first exposure to grammars was with the C Standard, in which each section having grammar puts that in a Syntax subsection, which is optionally followed by a Constraints subsection. So, I think of each exception in the examples above as being a constraint.

I see three ways to go here:

  1. Leave it as is
  2. Simplify the alternative and state the constraint in the accompanying comment
  3. Simplify the alternative and state the constraint in text below the grammar

These are illustrated below, in order:

'<Any Identifier_Or_Keyword except true or false>'
Identifier_Or_Keyword   // But NOT true or false
Identifier_Or_Keyword   // see constraint below

Constraint: The Identifier_Or_Keyword in Conditional_Symbol shall not be true or false.

Recommendation 5: What approach do we want?

Nigel-Ecma commented 3 years ago

Following comments on #225 & #230:

Recommendation 1B: skip using ANTLR escapes and go straight to Unicode

Recommendation 1D: allow sets or alternates as in Recommendation 2, the subjective choice based on readability

Recommendation 2: drop the ANTLR escapes but otherwise allow either

Recommendation 3: Yes use the short names, I think they are in more common usage

Recommendation 4A: Either both spellings or follow 4B and use semantic predicates – choose subjectively

Recommendation 4B: Semantic predicates, so we have a grammar which works. We can provide C# code for the predicates as I understand ANTLR can output C# (I haven't confirmed that and was wondering whether adding Java code for them would be a good look or not ;-))

Recommendation 5: I'd say semantic predicates again. Wherever we use a semantic predicate we can name it and/or add a comment to indicate its purpose - { ButNotTrueOrFalse() }? or { IsIdentifierOrKeywordOtherThanTrueAndFalse() }? maybe

jskeet commented 3 years ago

(Note to myself as much as anyone else.) Presumably in 4A we could use [fF] to avoid show two entirely separate options?

Other than that note, I agree with Nigel. Presumably for recommendation 5, we really need to use something more than textual constraints if we want the grammar to validate correctly?

sharwell commented 3 years ago

Where can I easily get the jar file for your optimized fork? I'm happy to plug that in.

It's on Maven Central here: https://search.maven.org/artifact/com.tunnelvisionlabs/antlr4/4.9.0/jar

It looks like the direct download link is: https://repo1.maven.org/maven2/com/tunnelvisionlabs/antlr4/4.9.0/antlr4-4.9.0-complete.jar

sharwell commented 3 years ago

Item 1-3 seem fine for me.

4A: '\\u005F' is easy, but there is a combinatorics problem for each additional letter. I recommend using '\\u005' [fF] instead.

4B: A semantic predicate is the least ugly option here.

5: At this point we're talking more about parser rules than lexer rules.

An Identifier_Or_Keyword that is not a Keyword

I would recommend rewriting this as:

availableIdentifier
  : Identifier
  | contextualKeyword
  ;

contextualKeyword
  : // itemized list of all keyword-like tokens that are not keywords
  ;

Any Identifier_Or_Keyword except true or false

Handling this depends on whether we treat pre-processing as part of the normal lexer pass or as a separate pass before the lexer. If it's all one pass, we could use this:

conditionalKeyword
  : 'true' | 'false'
  ;

keyword
  : conditionalKeyword
  | // all the keywords except true and false
  ;
RexJaeschke commented 3 years ago

Re Proposal 6 (unsafe grammar), on this week's call we agreed that "unsafe type" should be "pointer" type. With that change, Rex's proposal was accepted, and he'll turn that into a PR.

RexJaeschke commented 3 years ago

Regarding Proposal 3, "<...>" (https://github.com/dotnet/csharpstandard/issues/37#issuecomment-789755202), I've taken into account input from Nigel (https://github.com/dotnet/csharpstandard/issues/37#issuecomment-793154408) and Sam (https://github.com/dotnet/csharpstandard/issues/37#issuecomment-796165854).

The handling of the Unicode sequences in identifiers results in the following grammar, with current alternatives to be replaced commented out and followed by their replacement(s):

Letter_Character
//    : '<A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'
//    | '<A Unicode_Escape_Sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl>'

    : [\p{L}\p{Nl}]     // category Letter, all subcategories; category Number, subcategory letter
    | Unicode_Escape_Sequence { IsLetterCharacter() }?
    ;

Combining_Character
//    : '<A Unicode character of classes Mn or Mc>'
//    | '<A Unicode_Escape_Sequence representing a character of classes Mn or Mc>'

    : [\p{Mn}\p{Mc}]    // category Mark, subcategories non-spacing and spacing combining
    | Unicode_Escape_Sequence { IsCombiningCharacter() }?
    ;

Decimal_Digit_Character
//    : '<A Unicode character of the class Nd>'
//    | '<A Unicode_Escape_Sequence representing a character of the class Nd>'

    : [\p{Nd}]      // category Number, subcategory decimal digit
    | Unicode_Escape_Sequence { IsDecimalDigitCharacter() }?
    ;

Connecting_Character
//    : '<A Unicode character of the class Pc>'
//    | '<A Unicode_Escape_Sequence representing a character of the class Pc>'

    : [\p{Pc}]      // category Punctuation, subcategory connector
    | Unicode_Escape_Sequence { IsConnectingCharacter() }?
    ;

Formatting_Character
//    : '<A Unicode character of the class Cf>'
//    | '<A Unicode_Escape_Sequence representing a character of the class Cf>'

    : [\p{Cf}]      // category Other, subcategory format
    | Unicode_Escape_Sequence { IsFormattingCharacter() }?
    ;

While the definitions of the Is* predicate methods need to be in the grammar file input to ANTLR when validation occurs (as part of Bill's actions), I see a way to smuggle them in without having to include them in the spec.

This leaves the final two uses of the `'<...>' notation

Available_Identifier
//    : '<An Identifier_Or_Keyword that is not a Keyword>'

    : Identifier_Or_Keyword { IsNotAKeyword() }?
    ;

Conditional_Symbol
//    : '<Any Identifier_Or_Keyword except true or false>'

    : Identifier_Or_Keyword { IsNotTrueOrFalse() }?
    ;

Nigel and I support the use of semantic predicates here as well, which I have shown above. Sam proposed some rearrangement of the grammar instead. @sharwell, is the semantic predicate approach OK with you?

How does this all look to you, @Nigel-Ecma?

RexJaeschke commented 3 years ago

Proposal 5: Resolving Mutual Left-Recursion issue

@MadsTorgersen, @sharwell, @Nigel-Ecma

There are three lexer rules that have this problem, all in the preprocessor grammar, w.r.t Pp_Expressions involving ||, &&, ==, and !=.

On a whim, I rewrote those rules, as follows, with each commented-out line being replaced by the one following, as follows:

Pp_Or_Expression
    : Pp_And_Expression

//  | Pp_Or_Expression Whitespace? '||' Whitespace? Pp_And_Expression
    | Whitespace? Pp_And_Expression '||' Pp_Or_Expression Whitespace?
    ;

Pp_And_Expression
    : Pp_Equality_Expression

//  | Pp_And_Expression Whitespace? '&&' Whitespace? Pp_Equality_Expression
    | Whitespace? Pp_Equality_Expression '&&' Pp_And_Expression Whitespace?
    ;

Pp_Equality_Expression
    : Pp_Unary_Expression

//  | Pp_Equality_Expression Whitespace? '==' Whitespace? Pp_Unary_Expression
    | Whitespace? Pp_Unary_Expression '==' Pp_Equality_Expression Whitespace? 

//  | Pp_Equality_Expression Whitespace? '!=' Whitespace? Pp_Unary_Expression
    | Whitespace? Pp_Unary_Expression '!=' Pp_Equality_Expression Whitespace?  
    ;

All I did was simply swap the two operands over, and it validated! But are the new forms equivalent/correct?

Nigel-Ecma commented 3 years ago

All I did was simply swap the two operands over, and it validated! But are the new forms equivalent/correct?

It changes the grouping/associativity, e.g. given a || b || c one version parses it is (a || b) || c the other as a || (b || c)

Does it matter in this case (obviously it can matter in some cases)?

(Sorry no answer right now, I’m distracting myself clearing email when I should be doing something else which I must get back to…)

On 16/04/2021, at 3:05 am, Rex Jaeschke @.***> wrote:

Proposal 5: Resolving Mutual Left-Recursion issue

@MadsTorgersen https://github.com/MadsTorgersen, @sharwell https://github.com/sharwell, @Nigel-Ecma https://github.com/Nigel-Ecma There are three lexer rules that have this problem, all in the preprocessor grammar, w.r.t Pp_Expressions involving ||, &&, ==, and !=.

On a whim, I rewrote those rules, as follows, with each commented-out line being replaced by the one following, as follows:

Pp_Or_Expression : Pp_And_Expression

// | Pp_Or_Expression Whitespace? '||' Whitespace? Pp_And_Expression | Whitespace? Pp_And_Expression '||' Pp_Or_Expression Whitespace? ;

Pp_And_Expression : Pp_Equality_Expression

// | Pp_And_Expression Whitespace? '&&' Whitespace? Pp_Equality_Expression | Whitespace? Pp_Equality_Expression '&&' Pp_And_Expression Whitespace? ;

Pp_Equality_Expression : Pp_Unary_Expression

// | Pp_Equality_Expression Whitespace? '==' Whitespace? Pp_Unary_Expression | Whitespace? Pp_Unary_Expression '==' Pp_Equality_Expression Whitespace?

// | Pp_Equality_Expression Whitespace? '!=' Whitespace? Pp_Unary_Expression | Whitespace? Pp_Unary_Expression '!=' Pp_Equality_Expression Whitespace?
; All I did was simply swap the two operands over, and it validated! But are the new forms equivalent/correct?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dotnet/csharpstandard/issues/37#issuecomment-820502636, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSYVWYGM4PWKLW2OA4YHG3TI36DPANCNFSM4T5ICDNA.

RexJaeschke commented 3 years ago

This Issue was closed automatically when the unsafe grammar PR was merged, but we still have the left-recursive situation to deal with, so am re-opening.

Nigel-Ecma commented 3 years ago

Issues #259 & #260 raise questions which are dependent on this issue and investigations into the grammar it has triggered. Those questions should be considered in any resolution of this one.

RexJaeschke commented 3 years ago

Proposal 7: Marking lexer helper rules with fragment

Previously, it seemed that applying fragment to a lexer rule might be useful, but not necessary. However, as Nigel discovered, it really should be used on all lexer rules that help other rules. That is, any lexer rule that does not result in a complete token should be marked fragment. For example:

Decimal_Integer_Literal     // forms a complete token; is not a helper rule
    : Decimal_Digit+ Integer_Type_Suffix?
    ;

fragment
Decimal_Digit         // helper rule; should have fragment prefix
    : '0'..'9'
    ;

fragment
Integer_Type_Suffix         // helper rule; should have fragment prefix
    : 'U' | 'u' | 'L' | 'l' | 'UL' | 'Ul' | 'uL' | 'ul' | 'LU' | 'Lu' | 'lU' | 'lu'
    ; 

As such, Nigel will add this prefix, as appropriate, as part of his efforts.

RexJaeschke commented 3 years ago

Proposal 8: Updating text in lexical-structure.md to state our intent w.r.t using ANTLR notation vs. having an ANTLR-ready grammar.

Nigel is handling this as part of his grammar overhaul PRs.

RexJaeschke commented 3 years ago

@Nigel-Ecma I'm inclined to close this Issue, unless you want to keep it open as a hook for things you are still working on.

Nigel-Ecma commented 3 years ago

I think closing it is fine, you could reference the various PR’s now merged (interpolated strings mentioned in #37 are still in progress but a PR will come)

RexJaeschke commented 3 years ago

Almost all of the outstanding tasks were handled by Nigel in various PRs, including #339, #342, and #351. Closing this Issue with Nigel's consent.