Reports strange error when combining grammars named X.g4 and XLexer.g4

antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

http://antlr.org

BSD 3-Clause "New" or "Revised" License

17.3k stars 3.3k forks source link

Reports strange error when combining grammars named X.g4 and XLexer.g4 #966

Open WalkerCodeRanger opened 9 years ago

WalkerCodeRanger commented 9 years ago

With ANTLR 4.5.1, following the example on page 36 of the book where the lexer and parser grammars are split up resulting in "LibExpr.g4". I split my grammar into lexical and parsing rules. I named them "X.g4" and "XLexer.g4" following the example of the CSharp example grammar. However I got an error like:

error(113): X.g4:2:7: combined grammar X and imported lexer grammar XLexer both generate XLexer

Firstly, that error meant nothing to me and gave absolutely no indication what the problem was. Turns out if I name my lexer grammar anything except "XLexer" the error goes away. I presume this is the result of some check for duplicate named grammars (to prevent cycles?) combined with some rule that drops "Lexer" off the name of lexer grammars so that if you generate a lexer directly from them it won't be result "XLexerLexer.java". However, I think in this case the error is a bug since it shouldn't matter in this case. Indeed, these seem like eminently reasonable names for grammar files.

jimidle commented 9 years ago

Does your lexer grammar start with:

lexer grammar ....

and your parser grammar start with

parser grammar ...

Otherwise the tool will assume you have a combined grammar that will also generate the lexer of the same name.

Jim

On Sun, Aug 2, 2015 at 11:08 PM, Jeff Walker Code Ranger < notifications@github.com> wrote:

With ANTLR 4.5.1, following the example on page 36 of the book where the lexer and parser grammars are split up resulting in "LibExpr.g4". I split my grammar into lexical and parsing rules. I named them "X.g4" and "XLexer.g4" following the example of the CSharp example grammar https://github.com/antlr/grammars-v4/tree/master/csharp. However I got an error like:

error(113): X.g4:2:7: combined grammar X and imported lexer grammar XLexer both generate XLexer

Firstly, that error meant nothing to me and gave absolutely no indication what the problem was. Turns out if I name my lexer grammar anything except "XLexer" the error goes away. I presume this is the result of some check for duplicate named grammars (to prevent cycles?) combined with some rule that drops "Lexer" off the name of lexer grammars so that if you generate a lexer directly from them it won't be result "XLexerLexer.java". However, I think in this case the error is a bug since it shouldn't matter in this case. Indeed, these seem like eminently reasonable names for grammar files.

— Reply to this email directly or view it on GitHub https://github.com/antlr/antlr4/issues/966.

WalkerCodeRanger commented 9 years ago

Following the book example, the "XLexer.g4" contains a lexer grammar, but "X.g4" contains a combined grammar, because it is importing "XLexer.g4". I am expecting it to generate the lexer and parser together when I run

java org.antlr.v4.Tool X.g4

The simplest grammars I could come up with to reproduce this are

XLexer.g4

lexer grammar XLexer;
Char : .;

X.g4

grammar X;
import XLexer;
file : Char* EOF;

Running the tool on "X.g4" only produces the error. However, if I change the "XLexer.g4" to some other name such as "XTest.g4" and change the file contents accordingly, then it will produce the correct set of files. That includes "XParser.java" and "XLexer.java".

If I may make an observation, the book uses combined grammars for most of its examples (though they may import lexer grammars). However, combined grammars seems to have more issues. Namely, this and issues with importing grammars using channels and modes and the fact that channels aren't supported in combined grammars. The combined grammars and importing mechanism was more intuitive to me. I'd like to see support for them improved.

jimidle commented 9 years ago

You cannot import a lexer with the same name as the lexer generated by your combined grammar - so the tool is trying to tell you that you would have two lexers with the same name, which would generate two vocab files with the same name, etc. That's why it rejects it.

If you use the same name as the parser for the grammar, then you just import the vocab file that that lexer generates in to the parser. This is just keeping the lexer and parser in separate source code files.

I have not seen any issues with combined grammars that are not caused by misunderstandings or mistakes in the grammar. By all means report any issues that you find. ANTLR does take a little experience before it all fits together in an obvious manner.

Jim

On Mon, Aug 3, 2015 at 10:31 AM, Jeff Walker Code Ranger < notifications@github.com> wrote:

Following the book example, the "XLexer.g4" contains a lexer grammar, but "X.g4" contains a combined grammar, because it is importing "XLexer.g4". I am expecting it to generate the lexer and parser a together when I run

java org.antlr.v4.Tool X.g4

The simplest grammars I could come up with to reproduce this are

XLexer.g4

lexer grammar XLexer; Char : .;

X.g4

grammar X; import XLexer; file : Char* EOF;

Running the tool on "X.g4" only produces the error. However, if I change the "XLexer.g4" to some other name such as "XTest.g4" and change the file contents accordingly, then it will produce the correct set of files. That includes "XParser.java" and "XLexer.java".

If I may make an observation, the book uses combined grammars for most of its examples (though they may import lexer grammars). However, combined grammars seems to have more issues. Namely, this and issues with importing grammars using channels and modes and the fact that channels aren't supported in combined grammars. The combined grammars and importing mechanism was more intuitive to me. I'd like to see support for them improved.

— Reply to this email directly or view it on GitHub https://github.com/antlr/antlr4/issues/966#issuecomment-127102001.

WalkerCodeRanger commented 9 years ago

I still think this is a bug. If the behaviour isn't changed, then the error message at least needs improved.

Why it is a bug: In this situation you are not generating the lexer file. You are importing it. An imported grammar is merged into the grammar in question. It shouldn't matter what that grammar is named. Consider that if I import lexer grammar Foo into my X.g4 grammar, none of the files output (including the token vocab file) will have Foo anywhere in the name. Also, if you introduce an intermediate grammar (i.e. X imports Foo and Foo imports XLexer) then the error is not reported. If you were generating all imported grammars then that should be an error too, but you are not. They are just being merged together.

The reason the error message is confusing: Two reasons. First, I didn't ask it to generate a lexer for XLexer.g4 so I have no idea what it is talking about when it says XLexer generates something. Second, when it says "both generate XLexer" it has switched its meaning from how it used "XLexer" earlier in the sentence from a grammar to an output lexer. The second use needs some qualifiers to indicate what it is talking about.

About combined grammars Here is what I am referring to:

"error(164): custom channels are not supported in combined grammars"
importing a lexer grammar with channels into a combined grammar produces invalid code #965
"error(120): lexical modes are only allowed in lexer grammars"
importing a grammar with modes into a combined grammar produces invalid code #970
imported grammars are not validated for file name matching the grammar name #892
issues with tokens section in combined grammar #338

That seems like a lot issues and limitations. If these were fixed and removed then combined grammars with importing would actually be a really intuitive easy way of working.

P.S. I do now understand the use of the tokenVocab option, but I didn't when I started this because it is buried at the end of the book and not listed on the website #969.

sharwell commented 9 years ago

That seems like a lot issues and limitations. If these were fixed and removed then combined grammars with importing would actually be a really intuitive easy way of working.

I recommend only ever using a lexer grammar paired with a separate parser grammar, and never using import. In addition to being the most straightforward grammars to reason about, this strategy will block the use of certain ANTLR features (specifically the use of string literals to define tokens in a parser rule) which are easy to use incorrectly, introducing difficult to detect bugs into your application.

:memo: This is my personal opinion, but it reflects the manner in which I created a number of reasonably large projects that successfully used ANTLR in some key capacity.

WalkerCodeRanger commented 9 years ago

@sharwell thanks, I am coming to understand that. However, that is very at odds with the way The Definitive ANTLR 4 Reference presents it. New users who learn using the reference (which is really the only way since there is no documentation to speak of outside of it) are going to naturally start with combined grammars and importing. I feel like there is a disconnect there.

parrt commented 9 years ago

i usually use combined grammars but never import ;)

On Aug 4, 2015, at 6:51 AM, Sam Harwell notifications@github.com wrote:

That seems like a lot issues and limitations. If these were fixed and removed then combined grammars with importing would actually be a really intuitive easy way of working.

I recommend only ever using a lexer grammar paired with a separate parser grammar, and never using import. This strategy will block the use of certain ANTLR features (specifically the use of string literals to define tokens in a parser rule) which are easy to use incorrectly, introducing difficult to detect bugs into your application.

— Reply to this email directly or view it on GitHub.

jimidle commented 9 years ago

On Tue, Aug 4, 2015 at 8:41 PM, Jeff Walker Code Ranger < notifications@github.com> wrote:

I still think this is a bug. If the behaviour isn't changed, then the error message at least needs improved.

Perhaps the error message isn't clear enough if you are just starting out and could be improved.

Why it is a bug: In this situation you are not generating the lexer file. You are importing it. An imported grammar is merged into the grammar in question. It shouldn't matter what that grammar is named. Consider that if I import lexer grammar Foo into my X.g4 grammar, none of the files output (including the token vocab file) will have Foo anywhere in the name. Also, if you introduce an intermediate grammar (i.e. X imports Foo and Foo imports XLexer) then the error is not reported. If you were generating all imported grammars then that should be an error too, but you are not. They are just being merged together.

Well, you are basically trying to redesign the idea. It wasn't intended to do what you want it to do. It has to track the imports somehow, ignoring what it actually outputs afterwards.

The reason the error message is confusing: Two reasons. First, I didn't ask it to generate a lexer for XLexer.g4

By using grammar X you did in fact ask it to generate lexer X. Hence you cannot import another lexer grammar also called X.

That seems like a lot issues and limitations. If these were fixed and removed then combined grammars with importing would actually be a really intuitive easy way of working.

I don't think import is that intuitive myself. I always use a separate lexer and parser grammar - then you could import lexers in your lexer grammar, but if you try to import a lexer grammar of the same name as the lexer grammar you are importing, then which grammar should it import - itself?

Also, why use a combined grammar, then import the lexer? I think that the intent (and I may be putting words into people's mouths here) was that you might have some common lexer stuff such as SqlKeywordsLexer or FloatingPoint etc and then import them in to other grammars. In practice, I have not found that to be so practical.

michaelthoward commented 8 years ago

For what it's worth ... As relative newbie to Antlr4 I agree completely with WalkerCodeRanger's comments. Error message is terribly confusing.

phreed commented 7 years ago

So the recommendation is to not use import? The book says "It's a good idea to break up very large grammars into logical chunks, just like we do with software". I would think a good size for a 'chunk' would be about 20 rules.

RichardLake commented 1 year ago

Encountered the same issue and found this while searching for a solution. The one I found is to set the tokenVocab option like:

parser grammar ExampleParser;
options { tokenVocab = ExampleLexer; } // use tokens from ExampleLexer.g4 in the same directory

instead of trying to import the lexer into the parser.

If this is the correct method would there be an objection to adding a note to the two errors "parser grammar Example cannot import lexer grammar" and "combined grammar ExampleGrammer cobol and imported lexer grammar ExampleLexer both generate ExampleLexer"? Something like "If you are trying to use the tokens from the lexer in the parser refer to the tokenvocab option.".