Have ANTLR4 prevent conflict with user rule names by behind-the-scenes renaming of its own variables

SimonSntPeter commented 8 years ago

I have a grammar with rules named xif and xcontinue and others, which were renamed from if/continue/etc due to antlr creating variables with these names and clashing with java. It would be more polished if antlr distinguished its identifiers from any possible back-end reserved words somehow. A consistent prefixing scheme might suffice, preventing clashes but keep a human-comprehensible correspondence between the parser and the generated back-end code (perhaps have the prefix 'ANTLR4_' as a reserved prefix for antlr itself?).

sharwell commented 8 years ago

I like the idea of renaming identifiers which conflict in the target language rather than reporting an error.

Each target would be responsible for renaming identifiers according to the rules/limitations of that language
The only identifiers which could not be used for rule names are reserved words in the grammar language itself

The C# target actually already does this in the code generation template by changing identifiers like if to @if.

KvanTTT commented 8 years ago

Maybe it's better to restrict rule names from common "bad-word" set? (Union of sets: Java, C#, JavaScript etc.). Because of @if identifier anyway bad style of coding. P.S. Also there is a bug: default keyword is not being escaped after code generation for C# runtime (I guess this is a special word for StringTemplate library).

sharwell commented 8 years ago

Maybe it's better to restrict rule names from common "bad-word" set?

I would prefer to not do this, since we end up in a bad position in the future:

If a new language target is added, we are forced to either modify the ANTLR language to accommodate it (which would break existing grammars) or implement the original suggestion of renaming identifiers anyway.
If a supported language changes, we end up in the same position.
Some languages are not case sensitive. Supporting these languages would cause even more confusion since some reserved words in ANTLR would be case sensitive while others would not.

In addition, users targeting a specific language should not need to be aware of the syntax of other languages unrelated to their work.

parrt commented 7 years ago

Current targets already check for "bad words". e.g.,

    protected static final String[] javaKeywords = {
        "abstract", "assert", "boolean", "break", "byte", "case", "catch",
        "char", "class", "const", "continue", "default", "do", "double", "else",
        "enum", "extends", "false", "final", "finally", "float", "for", "goto",
        "if", "implements", "import", "instanceof", "int", "interface",
        "long", "native", "new", "null", "package", "private", "protected",
        "public", "return", "short", "static", "strictfp", "super", "switch",
        "synchronized", "this", "throw", "throws", "transient", "true", "try",
        "void", "volatile", "while"
    };

    /** Avoid grammar symbols in this set to prevent conflicts in gen'd code. */
    protected final Set<String> badWords = new HashSet<String>();

You should get a warning. E.g.,

error(134): T.g4:3:0: symbol if conflicts with generated code in target language or runtime

KvanTTT commented 6 years ago

Linked: https://github.com/antlr/antlr4/issues/1851

KvanTTT commented 2 years ago

@parrt I agree with @sharwell message. Moreover, I've approved a lot of problems with symbol conflicts in grammars-v4, you can find them by symbol-conflic label (already more than 50 and some of the issues are not marked). Grammar developers are forced to make a lot of useless work on identifiers renaming (dynamic -> dynamic_). @kaby76, @studentmain can you approve my words?

I suggest reopening the issue and the following fixes:

Add getEscapePrefix() and getEscapeSuffix() methods to Target that is used for keywords escaping. By default, getEscapePrefix() returns _, getEscapeSuffix() returns empty string. But they can be overriden in concrete classes. For instace, for C# getEscapePrefix() returns @, getEscapeSuffix() returns empty string (C# supports keywords escaping using @: @for). For Swift getEscapePrefix() and getEscapeSuffix() returns ` (see Swift doc).
Use these methods for identifiers normalization before passing them to StringTemplate processor. Use normalization only for words from badWords set.
Deprecate USE_OF_BAD_WORD error since it won't be longer reported.
Clean up the grammars repository: restore conflicting keywords to the previous names, e.g. dynamic_ -> dynamic.

It doesn't look like a big change to the core but it simplifies the development of universal grammars a lot. Also, it improves the clarity of grammars and breaks the dependency on target runtime. Also, I think grammars-v4 is very important for the ANTLR community, and its grammars are widely used.

kaby76 commented 2 years ago

@KvanTTT Yes, that looks like a good solution. Finding and fixing all these symbol conflicts was ridiculous busy work. It was alleviated to some extent after I wrote trrename and applied it to create one gigantic symbol renaming in 183 files. When/if this is fixed, we can rename everything back to a time before all this terrible renaming started, again using trrename (it works across imported grammars and between the lexer and parser, but it does not yet take regular expressions like sed). I know people were not happy with the renaming that I did. I assume getSuffix() would always be appended to a generated name regardless of conflict or not.

ericvergnaud commented 2 years ago

Re 1 I'd suggest a much simpler approach, where we simply add _rule to every parser rule name and _TOKEN to every lexer token name. Please note that this change will have significant impact on users, since they will have to refactor their code accordingly.

KvanTTT commented 2 years ago

I assume getSuffix() would always be appended to a generated name regardless of conflict or not.

I suggest adding prefix or suffix to only words from existing badWords. Otherwise, it requires big refactoring on the user side and looks not such clear.

Re 1 I'd suggest a much simpler approach, where we simply add _rule to every parser rule name and _TOKEN to every lexer token name. Please note that this change will have significant impact on users, since they will have to refactor their code accordingly.

Yes, this change has a significant impact on user code, also suffixes everywhere look a bit redundant. Also, I think it's not much simpler than my suggestion and not so natural (if runtime supports reserved word escaping it should be used).

KvanTTT commented 2 years ago

I renamed getSuffix -> getEscapeSuffix, getPrefix -> getEscapePrefix since it looks like more reasonable.

kaby76 commented 2 years ago

I assume getSuffix() would always be appended to a generated name regardless of conflict or not.

I suggest adding prefix or suffix to only words from existing badWords. Otherwise, it requires big refactoring on the user side and looks not such clear.

Are you planning to parameterize the "bad word list"? Unfortunately, we're still finding them. A few weeks ago I made a PR to rename "emptyStatement" for the Go target of java/java8, javascript/ecmascript, .... If you only correct what we know, then we're back where we started, i.e., we're going to still have to manually rename some symbol in grammars-v4 once we discover it. We still haven't ported quite a few grammars in grammars-v4 to Go and even less for the Cpp target. I only created CI for Cpp in grammars-v4 in the last week, and most of the grammars are skipped. I have absolutely no clue what will happen when I get around to Swift.

It would be good if I can adjust the renaming (per symbol, or across a symbol class--TOKEN_REF vs RULE_REF) when I make the port rather than wait for a new version of Antlr. I'm now wondering if this renaming by the Antlr tool is the right place for it. @studentmain and I have been bantering around for many months the idea of a "preprocessor" for the Antlr grammar in grammars-v4. It could rename symbol based on some parameterized rule.

KvanTTT commented 2 years ago

That assumes of course that we've found all the bad words for a target. Unfortunately, we're still finding them.

Yes, but I am still afraid of huge changes in user code if use your strategy. Also, I hope one moment in the future we'll find almost all of such bad words, and maybe ANTLR will be updated more frequently. For now, I suggest using escaping for only bad words by default and maybe an option that turns on escaping for all words (boolean --escaping or string --escapingSuffix?). The final decision is up to @parrt (It looks like I'm able to implement everything).

ericvergnaud commented 2 years ago

Hi,

not sure I agree with the proposed change.

I think rules should have the same name across all targets and all grammar names, such that it’s easier to locate rules within and across targets without having to remember per target implementation details. As an example, rule ’this’ should be named ’this_rule’ regardless of the target, otherwise it will be named ’this_rule’ in most targets, but @.***' in C# and ’this’ in Python… simply horrible. Every rule context is named xxxContext and we should follow a similar strategy for naming rules.

I agree that making this an option could reduce the immediate noise, but it would defeat the purpose i.e. help newbies avoid name collision issues.

End of the day, this might not be the right approach.

Eric

Le 30 déc. 2021 à 17:57, Ivan Kochurkin @.***> a écrit :

That assumes of course that we've found all the bad words for a target. Unfortunately, we're still finding them.

Yes, but I am still afraid of huge changes in user code if use your strategy. Also, I hope one moment in the future we'll find almost all of such bad words, and maybe ANTLR will be updated more frequently. For now, I suggest using escaping for only bad words by default and maybe an option that turns on escaping for all words (boolean --escaping or string --escapingSuffix?). The final decision is up to @parrt https://github.com/parrt (It looks like I'm able to implement everything).

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/1070#issuecomment-1003108196, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZNQJGOX5UNYP4O4NO4HKTUTSFQXANCNFSM4BW3ON4A. You are receiving this because you commented.

kaby76 commented 2 years ago

Yes, but I am still afraid of huge changes in user code if use your strategy.

Actually, what I'm arguing for is to allow the user to input the bad word list (e.g., "emptyStatement" => "emptyStatement_") or global renaming scheme ("TOKEN_REF" => "TOKENREF + ''") to the Antlr tool. People who use a grammar X for target Y will discover the symbol conflicts, and they can adjust the renaming to what they want, rather than modify grammar X and check that into grammars-v4. People who use grammar X for target W won't be affected since grammar X is constant and they presumably had their own discovery of symbol conflicts for target W. You can then release Antlr with a default list for target Y, W, ..., but be able to overwrite that list when a grammar is ported and not have to change the grammar, and not wait for another release of Antlr.

Otherwise, I can just run another tool, like trrename or @studentmain 's translator, prior to running CI for grammars-v4 and adjust that list rather than keep making PRs to change grammars for symbol conflict.

KvanTTT commented 2 years ago

I think rules should have the same name across all targets and all grammar names, such that it’s easier to locate rules within and across targets without having to remember per target implementation details.

Users ordinary use single target as I understand. But grammar developers write grammars for all runtimes. Thus per target implementation details is mostly their responsibility.

As an example, rule ’this’ should be named ’this_rule’ regardless of the target, otherwise it will be named ’this_rule’ in most targets, but @.***' in C# and ’this’ in Python… simply horrible.

I don't think it's a bit problem because code completion works well. In C# it will suggest @this as well it will suggest this_ in Python. Moreover _rule looks natural for Python runtime where snake_case is used, but it's ugly for C# and Java because they use CamelCase. Anyway, generated code differs across runtimes because of different conventions.

For your case, maybe it makes sense to add something like appendIdPart method to Target and override it in concrete classes (it appends _rule in Python, Rule in C#, etc.).

I agree that making this an option could reduce the immediate noise, but it would defeat the purpose i.e. help newbies avoid name collision issues.

It's a very big immediate noise covering all ANTLR runtime users that ideally should be prevented.

KvanTTT commented 2 years ago

If @ericvergnaud solution is accepted, I suggest fixing the following issue #1615 together since all runtime identifiers will be broken anyway. The mentioned issue has 13 upvotes thus it's important.

SimonSntPeter commented 2 years ago

Looking at #1615, my tuppence-worth: does it matter? I embed actions into the grammar. Very minimal ones, to build the AST, crash and burn on syntax errors, and pretty well nothing else eg.

fkjoins_branch returns [FKJoinsBranch fkjb] : SR fkjac = fkjoins_arrow_chain fkjbq = opt_fkjoins_branches ER { var fkjac = $fkjac.fkjac; var fkjbq = $fkjbq.fkjbq; var res = new FKJoinsBranch(fkjac, fkjbq);

$fkjb = res; } ;

I don't see the generated code unless something goes wrong, and the case style is irrelevant IMO.

Is this substantially a different situation when doing listeners/visitors, such that case naming starts to be of concern?

cheers

jan

On 30/12/2021 17:58, Ivan Kochurkin wrote:

If @ericvergnaud https://github.com/ericvergnaud solution is accepted, I suggest fixing the following issue #1615 https://github.com/antlr/antlr4/issues/1615 together since all runtime identifiers will be broken anyway. The mentioned issue has 13 upvotes thus it's important.

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/1070#issuecomment-1003128603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4SBWLQ5ZDT62VP4XNQVLTUTSMVBANCNFSM4BW3ON4A. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

KvanTTT commented 2 years ago

I'd suggest a opt-in feature, output LanguageParser.ParseRulename() method for rulename rule for every target, regardless if there's any symbol conflict.

It also looks good. By the way, when I was a newbie in ANTLR I was confused about how to call the parse method. In this case Parse suffix looks reasonable.

KvanTTT commented 2 years ago

I don't see the generated code unless something goes wrong, and the case style is irrelevant IMO.

You have to call parsing methods and access to parse tree nodes in visitor/listener (or without them). I don't think it's rare code.

Also, your message is a bit irrelevant to the current topic.

ericvergnaud commented 2 years ago

‘very big noise’ vs small problem… not sure anyone will like this

Le 30 déc. 2021 à 18:43, Ivan Kochurkin @.***> a écrit :

I think rules should have the same name across all targets and all grammar names, such that it’s easier to locate rules within and across targets without having to remember per target implementation details.

Users ordinary use single target as I understand. But grammar developers write grammars for all runtimes. Thus per target implementation details is mostly their responsibility.

As an example, rule ’this’ should be named ’this_rule’ regardless of the target, otherwise it will be named ’this_rule’ in most targets, but @.***' in C# and ’this’ in Python… simply horrible.

I don't think it's a bit problem because code completion works well. In C# it will suggest @this as well it will suggest this_ in Python. Moreover _rule looks natural for Python runtime where snake_case is used, but it's ugly for C# and Java because they use CamelCase. Anyway, generated code differs across runtimes because of different conventions.

I agree that making this an option could reduce the immediate noise, but it would defeat the purpose i.e. help newbies avoid name collision issues.

It's a very big immediate noise covering all ANTLR runtime users that ideally should be prevented.

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/1070#issuecomment-1003123678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZNQJHQT3IC7ZIZATTYBULUTSK33ANCNFSM4BW3ON4A. You are receiving this because you commented.

parrt commented 2 years ago

Good morning! well this is a difficult decision. I personally ran into this naming issue the other day when I tried to run the Java grammar using Python or something and I got a bad word collision. I had to change the grammar and that annoyed the crap out of me. haha. welcome to the party I guess.

Let me see if I understand the situation. Users access parser rule x by calling x() in their support code and they also access XContext as well as the field y associated with rule argument y and so on. The essential problem is that sometimes x and y are reserved words in the target language, which will of course cause a syntax error for the generated code.

Thinking out loud...

I take it as axiomatic that we don't want to force everyone to manually go edit all of their support code (visitors etc...) to change every single x and y reference for existing working projects. Existing grammar builds + support code should continue to work, even if someone had to alter the grammar rule names to make it work with that target.
I think it's reasonable to suggest that we remove the bad word error message and simply ask the target to tweak the symbol for generation purposes. E.g., x -> _x or whatever. This can only affect new grammars because previously it was an error, hence, no backward compatibility issues. It does mean that programmers will have to be aware that rule class becomes _class in the generated code, but we are used to this kind of thing as programmers.
It seems that we should do the minimal change, at the cost of a bit of inconsistency. Rule x -> _x but also to XContext. I'm not sure I like _xContext or _XContext.
I'm not concerned that it might be x in Python but _x in DART support code; everything will be self consistent within a specific target world.

I would propose a single default target method, such as escageTargetSymbol() or something, that added _ as a prefix or something; individual targets could do as they like. Moreover, this method would only be called for the list of known bad words. I value the relationship between a grammar rule name and argument name and the same name in the generated code more than I dislike the inconsistency of few symbols that must be escaped.

KvanTTT commented 2 years ago

Good morning!

Good evening! (actually almost night for me) :)

If I understand correctly, you suggest almost the same algorithm that I suggested, but with only a single method escapeTargetSymbol (maybe just escapeSymbol since it's located in Target) with a default implementation that adds _ prefix but can be overriden. It sounds good.

But why do you prefer prefix over suffix? I guess suffix is better for code completion and clarity (private fields start with _ suffix in some languages).

parrt commented 2 years ago

:) I don't prefer any translation... escapeTargetSymbol(String symbol) would return a new string. why limited to a prefix or suffix? I can imagine an adoring fanboy wanting x -> xPARRTx :)

ericvergnaud commented 2 years ago

Ter,

this will not only affect new grammars but also new targets for existing grammars (which is what Ken Domino bumped into when checking grammars in our repo).

Eric

Le 30 déc. 2021 à 20:22, Terence Parr @.***> a écrit :

Good morning! well this is a difficult decision. I personally ran into this naming issue the other day when I tried to run the Java grammar using Python or something and I got a bad word collision. I had to change the grammar and that annoyed the crap out of me. haha. welcome to the party I guess.

Let me see if I understand the situation. Users access parser rule x by calling x() in their support code and they also access XContext as well as the field y associated with rule argument y and so on. The essential problem is that sometimes x and y are reserved words in the target language, which will of course cause a syntax error for the generated code.

Thinking out loud...

I take it as axiomatic that we don't want to force everyone to manually go edit all of their support code (visitors etc...) to change every single x and y reference for existing working projects. Existing grammar builds + support code should continue to work, even if someone had to alter the grammar rule names to make it work with that target. I think it's reasonable to suggest that we remove the bad word error message and simply ask the target to tweak the symbol for generation purposes. E.g., x -> _x or whatever. This can only affect new grammars because previously it was an error, hence, no backward compatibility issues. It does mean that programmers will have to be aware that rule class becomes _class in the generated code, but we are used to this kind of thing as programmers. It seems that we should do the minimal change, at the cost of a bit of inconsistency. Rule x -> _x but also to XContext. I'm not sure I like _xContext or _XContext. I'm not concerned that it might be x in Python but x in DART support code; everything will be self consistent within a specific target world. I would propose a single default target method, such as escageTargetSymbol() or something, that added as a prefix or something; individual targets could do as they like. Moreover, this method would only be called for the list of known bad words. I value the relationship between a grammar rule name and argument name and the same name in the generated code more than I dislike the inconsistency of few symbols that must be escaped.

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/1070#issuecomment-1003153220, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZNQJB7YRON52JZUJB4LQLUTSWORANCNFSM4BW3ON4A. You are receiving this because you were mentioned.

kaby76 commented 2 years ago

Whether "newbie" or "power user", people just want to get something from grammars-v4 to work. And, believe it or not, people like Antlr because of the huge grammar library in grammars-v4. So, yes, as an advocate of Antlr, this matters. Either:

1) Implement renaming in the Antlr tool so that it generates code that outputs code that works. 2) Implement this in a preprocessor instead of Antlr proper. We are already toying with the idea of adjusting the grammar for other reasons such as rewriting action code containing "this." to "p." for the Go target. I already have Trash which can rename a large list of symbols quickly, but I don't rewrite action code yet. 3) Do the status quo, i.e., rename symbols in a grammar in grammars-v4 periodically when I do a port. Then, I can enjoy being on the receiving end of comments from irritating the living hell out of existing developers who grab updates from grammars-v4 and have listeners and visitors source code that requires updating.

"Newbies" don't understand what a "symbol conflict" means (see this or this).

kaby76 commented 2 years ago

(P.S., this is a great discussion! Let's come up with a good solution.)

parrt commented 2 years ago

hi @studentmain, I see the argument for consistency so that a tool can process output, but how often do people want to translate their generated code? Yeah, languages do drift over time and we definitely don't have a complete list of keywords... we mainly added as we ran into issues. if we don't have a complete list of bad words then it argues for a general translation of input grammar symbols to output symbols.

if we assume we do need such a translation for the moment, that would argue for a single way to do it, not an option. If we had an option, then it would be very hard for me to work with or even read multiple projects associated with the same target language; one person might convert x to _x and another might do something insane like x_poopoo. Basically, this would argue for a method that translates all rule, arg, attribute names.

Ugh. then this brings up what happens with rule x context objects? I think this cannot be an option because the templates all refer to Context etc... Only the symbol itself should be translated I think.

KvanTTT commented 2 years ago

I think it's hard to say we have such list, they are just endless, languages are keeping adding new features (new keywords) and we are keeping adding runtimes.

Each runtime contains such a list, see JavaTarget for example, that's not a problem. The main problem is the frequency of ANTLR updates.

KvanTTT commented 2 years ago

Implement renaming in the Antlr tool so that it generates code that outputs code that works.

I like this item because it's the simplest one for users. Generally, a user doesn't care why ANTLR can't generate parser code, and it's hard to blame him. Any preprocessor complicates the entire process.

I already have Trash which can rename a large list of symbols quickly, but I don't rewrite action code yet.

I've spent some time on actions translations and have figured out that it's not possible to implement it (or dirty) correctly without StringTemplate modification (mapping actions from generated code to grammar to show correct error position). But it's a topic for another discussion.

parrt commented 2 years ago

Each runtime contains such a list, see JavaTarget for example, that's not a problem. The main problem is the frequency of ANTLR updates.

That argues then for only translating those bad symbols.

parrt commented 2 years ago

It's not about translate generated code.

Take trgen as an example, it create a parser exectuable (similar to grun) project for many target languages from a template. It will lookup directory's Maven project config, find the entry rule, generate code to call corresponding method.

I see. Hmm...well, this is a fairly specialized issue and maybe shouldn't drive common case.

kaby76 commented 2 years ago

I see. Hmm...well, this is a fairly specialized issue and maybe shouldn't drive common case.

Yes, you may be right. For sure, fixing symbol conflicts in grammars-v4 is a PITA because we have to go around and fix the old grammars. And, while we might get some flak over the rename, it's really the least of problems for an end-user of the grammar when the grammar is charged for an important bug fix.

kaby76 commented 2 years ago

Geesus! :) https://groups.google.com/g/antlr-discussion/c/wsvqSk-K8kM

KvanTTT commented 2 years ago

Re 1 I'd suggest a much simpler approach, where we simply add _rule to every parser rule name and _TOKEN to every lexer token name.

Just little addition. Even if using different suffixes/prefixes, the final identifiers may conflict with runtime keywords in theory (but it's a much rare case).

parrt commented 2 years ago

From the PR, I am seeing both Rule object fields .rule and .runtimeName. We have things like RULE_<currentRule.name> that maybe should be RULE_x not RULE_x_. But then I realized, if we ever translate x to x_ it should be always x_ regardless so it's consistent. Hmm...yeah, so it argues for having <currentRule.name> return the runtime version and never the original since we are in the code gen templates. Anybody disagree?

parrt commented 2 years ago

Ivan points out rule @x in C# would then cause RULE_@x which won't compile; and we def want original names in strings like so we need to have both values available. Hmm...i just kinda don't like the massive tweak to all targets required by this: .rule -> .runtimeName. Also it's weird to me if I'm working in the generated code if I call parser.foo_() but use constant RULE_foo w/o the underscore. I don't like the inconsistency there.

KvanTTT commented 2 years ago

Ok, it can be resolved by checking the method supportsWordEscaping. If it returns true (C#, Swift) then use original names for words used in concatenation. But anyway it requires changes in templates where concatenation expression is used.

parrt commented 2 years ago

Probably we should just choose an escape that always works for that target.

KvanTTT commented 2 years ago

Hmm...yeah, so it argues for having return the runtime version and never the original since we are in the code gen templates. Anybody disagree?

Unfortunately, it's anyway not so easy because grammar semantics analyzer and grammar generator use different names during processing. Analyzer always uses original names, which can ref other rules in original grammar. The generator uses mostly runtimeName (but not always as I understand). It's important not to mix them. Current code is written without the conception of different grammar/generated rule names.

KvanTTT commented 2 years ago

Probably we should just choose an escape that always works for that target.

The point is about native words escaping in C# and Swift. @if is translated to real if during compilation, that's why it's better to use such a feature if it's available. Without consideration of such a feature, we can use just word + "_" for all targets, but it's not very clear.

KvanTTT commented 2 years ago

Also it's weird to me if I'm working in the generated code if I call parser.foo_() but use constant RULE_foo w/o the underscore. I don't like the inconsistency there.

BTW, targets at least use case transformation for some entities, i.e. for -> For. It's also a bit inconsistent.

parrt commented 2 years ago

BTW, targets at least use case transformation for some entities, i.e. for -> For. It's also a bit inconsistent.

But usually as part of another word, I think right?

KvanTTT commented 2 years ago

Yes, for all targets except Go. It has For as an identifier. Also For_Context looks worse than ForContext. If accordance with grammar is more important then my approach is more preferable, if accordance with runtime is more preferable then your approach wins. And everything can be described in the documentation.

KvanTTT commented 2 years ago

The point is about native words escaping in C# and Swift. @if is translated to real if during compilation, that's why it's better to use such a feature if it's available. Without consideration of such a feature, we can use just word + "_" for all targets, but it's not very clear.

Now I think maybe it's more compilated for C# and Swift. @x is related only to keywords but not for potential conflicting symbols in runtime (for instance, parser methods of fields). But using overloads I think it's possible to implement this difference in concrete targets.

parrt commented 2 years ago

Yep, every target can implement the simple rewriting as they want. current thoughts:

You've convinced me that we need both symbols, the original and the escaped version (if any) at runtime and so the change to all targets is justified.
Ignore situation where someone names their grammar for... let it become for.java and class for extends Parser {...} in the generated code and fail. not a big deal to change this in the grammar for new people and we don't risk changing anything for existing grammars.
I think the only remaining issue is whether x always becomes x_ if it is a reserved work or whether we only do it in isolation. Choice 1: rule for still becomes Rule_for but the function associated with it becomes for_(). Choice 2: for becomes Rule_for_ and for_(). I'd like to hear other people's opinions after vacation, but I think now I am leaning towards a simple rule which is: due the minimal translation necessary to get the generated code to compile. This is choice number one, and I think what @KvanTTT is advocating.

KvanTTT commented 2 years ago

This is choice number one, and I think what @KvanTTT is advocating.

I like the first option more, and I'm trying to implement both of them for comparison: 1, 2. It looks like they require a comparable amount of changes, but the first one requires more template changes, the second one requires more tool changes. But the first choice looks preferable in my opinion because escaping is related only to conflicting keywords and symbols, and original rule names without escaping can be obtained. But I also accept the second option, if it's more preferable for all.

parrt commented 2 years ago

it looks like you are saying you like the second option more and then you say the first option. :) Oh, i see. you like the second option but you think we should implement the first.

KvanTTT commented 2 years ago

Sorry, I messed up. I meant I like the first choice when a rule for becomes Rule_for (not Rule_for_), but for_.

parrt commented 2 years ago

Yeah makes most sense I think.

parrt commented 2 years ago

Just to connect things, @KvanTTT is working on https://github.com/antlr/antlr4/pull/3451

antlr / antlr4

Have ANTLR4 prevent conflict with user rule names by behind-the-scenes renaming of its own variables #1070