antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.11k stars 3.28k forks source link

C# runtime: TokenFactory on parser is read only #2726

Open binarycow opened 4 years ago

binarycow commented 4 years ago

C# Runtime, Nuget package version 4.7.2, ANTLR version antlr-4.8-complete.jar

Hello! I have two grammar files. One is a lexer grammar (so I can use modes) and the other is a parser grammar.

I have written a custom token, inheriting from CommonToken. I created a token factory, implementing ITokenFactory.

I can set the token factory of the lexer just fine, using the below code:

lexer.TokenFactory = tokenFactory;

But, I cannot set the parser's Token Factory property, since it is read only.

I would expect to be able to use this code:

parser.TokenFactory = tokenFactory;

Is there something I am missing? I did search for information, and what I found about the Java runtime implies this is possible (in general), but I cannot see how to do it with the C# runtime.

Thanks in advance!

ericvergnaud commented 4 years ago

Hi,

the lexer is where tokens are given birth the parser accessors are just shortcuts to the underlying lexer token factory

Eric

Le 18 janv. 2020 à 23:01, Mike Christiansen notifications@github.com a écrit :

C# Runtime, Nuget package version 4.7.2, ANTLR version antlr-4.8-complete.jar

Hello! I have two grammar files. One is a lexer grammar (so I can use modes) and the other is a parser grammar.

I have written a custom token, inheriting from CommonToken. I created a token factory, implementing ITokenFactory.

I can set the token factory of the lexer just fine, using the below code:

lexer.TokenFactory = tokenFactory;

But, I cannot set the parser's Token Factory property, since it is read only.

I would expect to be able to use this code:

parser.TokenFactory = tokenFactory;

Is there something I am missing? I did search for information, and what I found about the Java runtime implies this is possible (in general), but I cannot see how to do it with the C# runtime.

Thanks in advance!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AAZNQJAXU36DHQ4QEM5PQ3LQ6MKUPA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHDP46Q, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZNQJBCA7TQQ7Z5SLS37Q3Q6MKUPANCNFSM4KITOGUA.

binarycow commented 4 years ago

Oh. Okay. So as long as the ID numbers match for each token it's not an issue?

Speaking of, is there an easy way of ensuring the token ids match in each grammar? If I add tokens to the lexer grammar, I have to make sure I add them to the parser grammar in the exact same order.

On Sat, Jan 18, 2020, 20:58 ericvergnaud notifications@github.com wrote:

Hi,

the lexer is where tokens are given birth the parser accessors are just shortcuts to the underlying lexer token factory

Eric

Le 18 janv. 2020 à 23:01, Mike Christiansen notifications@github.com a écrit :

C# Runtime, Nuget package version 4.7.2, ANTLR version antlr-4.8-complete.jar

Hello! I have two grammar files. One is a lexer grammar (so I can use modes) and the other is a parser grammar.

I have written a custom token, inheriting from CommonToken. I created a token factory, implementing ITokenFactory.

I can set the token factory of the lexer just fine, using the below code:

lexer.TokenFactory = tokenFactory;

But, I cannot set the parser's Token Factory property, since it is read only.

I would expect to be able to use this code:

parser.TokenFactory = tokenFactory;

Is there something I am missing? I did search for information, and what I found about the Java runtime implies this is possible (in general), but I cannot see how to do it with the C# runtime.

Thanks in advance!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AAZNQJAXU36DHQ4QEM5PQ3LQ6MKUPA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHDP46Q>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZNQJBCA7TQQ7Z5SLS37Q3Q6MKUPANCNFSM4KITOGUA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AI4GGVFAQM7PDNNR3DTVNSTQ6OXSVA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKGUDY#issuecomment-575957519, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4GGVBJ44L5MXRL2IAA7HDQ6OXSVANCNFSM4KITOGUA .

ericvergnaud commented 4 years ago

you cannot add tokens to a parser grammar so not sure how they would not match. (unless you are missing the ‘parser grammar’ declaration in your g4?)

Le 19 janv. 2020 à 10:15, Mike Christiansen notifications@github.com a écrit :

Oh. Okay. So as long as the ID numbers match for each token it's not an issue?

Speaking of, is there an easy way of ensuring the token ids match in each grammar? If I add tokens to the lexer grammar, I have to make sure I add them to the parser grammar in the exact same order.

On Sat, Jan 18, 2020, 20:58 ericvergnaud <notifications@github.com mailto:notifications@github.com> wrote:

Hi,

the lexer is where tokens are given birth the parser accessors are just shortcuts to the underlying lexer token factory

Eric

Le 18 janv. 2020 à 23:01, Mike Christiansen <notifications@github.com mailto:notifications@github.com> a écrit :

C# Runtime, Nuget package version 4.7.2, ANTLR version antlr-4.8-complete.jar

Hello! I have two grammar files. One is a lexer grammar (so I can use modes) and the other is a parser grammar.

I have written a custom token, inheriting from CommonToken. I created a token factory, implementing ITokenFactory.

I can set the token factory of the lexer just fine, using the below code:

lexer.TokenFactory = tokenFactory;

But, I cannot set the parser's Token Factory property, since it is read only.

I would expect to be able to use this code:

parser.TokenFactory = tokenFactory;

Is there something I am missing? I did search for information, and what I found about the Java runtime implies this is possible (in general), but I cannot see how to do it with the C# runtime.

Thanks in advance!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AAZNQJAXU36DHQ4QEM5PQ3LQ6MKUPA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHDP46Q https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AAZNQJAXU36DHQ4QEM5PQ3LQ6MKUPA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHDP46Q>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZNQJBCA7TQQ7Z5SLS37Q3Q6MKUPANCNFSM4KITOGUA https://github.com/notifications/unsubscribe-auth/AAZNQJBCA7TQQ7Z5SLS37Q3Q6MKUPANCNFSM4KITOGUA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AI4GGVFAQM7PDNNR3DTVNSTQ6OXSVA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKGUDY#issuecomment-575957519 https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AI4GGVFAQM7PDNNR3DTVNSTQ6OXSVA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKGUDY#issuecomment-575957519>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AI4GGVBJ44L5MXRL2IAA7HDQ6OXSVANCNFSM4KITOGUA https://github.com/notifications/unsubscribe-auth/AI4GGVBJ44L5MXRL2IAA7HDQ6OXSVANCNFSM4KITOGUA> .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AAZNQJB3COYR3VNGYQOTX5LQ6OZTRA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKG3YQ#issuecomment-575958498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZNQJCR2VYOTAII4HU23XTQ6OZTRANCNFSM4KITOGUA.

binarycow commented 4 years ago

Okay, here's my issue, regarding the token ordering...

I have a lexer grammar that defines eight tokens. They are defined in the below order, and my Lexer.cs file has them using these integer values:

I have a parser grammar, which uses the tokens in the below order, and my Parser.cs file has them using them defined with these integer values:

Additionally, I get warnings of implicit token creation when I execute antlr on the parser.g4 file.

When I run my test program, I check the token types, and they are all matched correctly. But the parser is not able to parse the input correctly. If I take the token type integers that the lexer reports, and compare it to the token type integers listed in the Parser.cs file, I can see that the parser, using the integer values, is parsing it "correctly" - from its perspective.

If I add this to the top of the parser grammar, it parses the output just fine. I also notice that the tokens are defined in Parser.cs with the correct integer numbers.

tokens { STATEMENT_START, OUTPUT, STATEMENT_END, KEYWORD_FOR, KEYWORD_ENDFOR, KEYWORD_IN, WHITESPACE, IDENTIFIER }

It's clear to me that:


Currently, I am manually keeping the two token lists in sync - I am taking the token names in the lexer.cs file, and putting them, in that order, in tokens { } section of the parser.g4 file. This fixes my issue, but its a pain.

Attached are sample files.

Thanks for any help you can provide.

On Sat, Jan 18, 2020, 21:22 ericvergnaud notifications@github.com wrote:

you cannot add tokens to a parser grammar so not sure how they would not match. (unless you are missing the ‘parser grammar’ declaration in your g4?)

Le 19 janv. 2020 à 10:15, Mike Christiansen notifications@github.com a écrit :

Oh. Okay. So as long as the ID numbers match for each token it's not an issue?

Speaking of, is there an easy way of ensuring the token ids match in each grammar? If I add tokens to the lexer grammar, I have to make sure I add them to the parser grammar in the exact same order.

On Sat, Jan 18, 2020, 20:58 ericvergnaud <notifications@github.com mailto:notifications@github.com> wrote:

Hi,

the lexer is where tokens are given birth the parser accessors are just shortcuts to the underlying lexer token factory

Eric

Le 18 janv. 2020 à 23:01, Mike Christiansen < notifications@github.com mailto:notifications@github.com> a écrit :

C# Runtime, Nuget package version 4.7.2, ANTLR version antlr-4.8-complete.jar

Hello! I have two grammar files. One is a lexer grammar (so I can use modes) and the other is a parser grammar.

I have written a custom token, inheriting from CommonToken. I created a token factory, implementing ITokenFactory.

I can set the token factory of the lexer just fine, using the below code:

lexer.TokenFactory = tokenFactory;

But, I cannot set the parser's Token Factory property, since it is read only.

I would expect to be able to use this code:

parser.TokenFactory = tokenFactory;

Is there something I am missing? I did search for information, and what I found about the Java runtime implies this is possible (in general), but I cannot see how to do it with the C# runtime.

Thanks in advance!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <

https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AAZNQJAXU36DHQ4QEM5PQ3LQ6MKUPA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHDP46Q < https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AAZNQJAXU36DHQ4QEM5PQ3LQ6MKUPA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHDP46Q , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAZNQJBCA7TQQ7Z5SLS37Q3Q6MKUPANCNFSM4KITOGUA < https://github.com/notifications/unsubscribe-auth/AAZNQJBCA7TQQ7Z5SLS37Q3Q6MKUPANCNFSM4KITOGUA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AI4GGVFAQM7PDNNR3DTVNSTQ6OXSVA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKGUDY#issuecomment-575957519 < https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AI4GGVFAQM7PDNNR3DTVNSTQ6OXSVA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKGUDY#issuecomment-575957519 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AI4GGVBJ44L5MXRL2IAA7HDQ6OXSVANCNFSM4KITOGUA < https://github.com/notifications/unsubscribe-auth/AI4GGVBJ44L5MXRL2IAA7HDQ6OXSVANCNFSM4KITOGUA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AAZNQJB3COYR3VNGYQOTX5LQ6OZTRA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKG3YQ#issuecomment-575958498>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZNQJCR2VYOTAII4HU23XTQ6OZTRANCNFSM4KITOGUA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AI4GGVECCJHTAERNM5G4S33Q6O2PFA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKG7GI#issuecomment-575958937, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4GGVABVKZXIVP46XFYDSTQ6O2PFANCNFSM4KITOGUA .

binarycow commented 4 years ago

Aha! I think I found the answer. options { tokenVocab = Lexer };

On Tue, Jan 21, 2020 at 11:44 AM Michael Christiansen < michael.a.christiansen10@gmail.com> wrote:

Okay, here's my issue, regarding the token ordering...

I have a lexer grammar that defines eight tokens. They are defined in the below order, and my Lexer.cs file has them using these integer values:

  • STATEMENT_START = 1
  • OUTPUT = 2
  • STATEMENT_END = 3
  • KEYWORD_FOR = 4
  • KEYWORD_ENDFOR = 5
  • KEYWORD_IN = 6
  • WHITESPACE = (No number, has a -> skip)
  • IDENTIFIER = 7

I have a parser grammar, which uses the tokens in the below order, and my Parser.cs file has them using them defined with these integer values:

  • OUTPUT = 1
  • STATEMENT_START = 2
  • KEYWORD_FOR = 3
  • IDENTIFIER = 4
  • KEYWORD_IN = 5
  • STATEMENT_END = 6
  • KEYWORD_ENDFOR = 7

Additionally, I get warnings of implicit token creation when I execute antlr on the parser.g4 file.

When I run my test program, I check the token types, and they are all matched correctly. But the parser is not able to parse the input correctly. If I take the token type integers that the lexer reports, and compare it to the token type integers listed in the Parser.cs file, I can see that the parser, using the integer values, is parsing it "correctly" - from its perspective.

If I add this to the top of the parser grammar, it parses the output just fine. I also notice that the tokens are defined in Parser.cs with the correct integer numbers.

tokens { STATEMENT_START, OUTPUT, STATEMENT_END, KEYWORD_FOR, KEYWORD_ENDFOR, KEYWORD_IN, WHITESPACE, IDENTIFIER }

It's clear to me that:

  • While I do not need to specify tokens in a parser file, if I don't, it will define them for me.
  • The 'token type' that is passed from the lexer to the parser is a pure integer.
  • If the integral value of the token type doesn't match what the parser would expect, it will not parse correctly.
  • The tokens { } section allows me to specify a specific order of tokens in the parser, so the lexer and parser are using the same token IDs.

Currently, I am manually keeping the two token lists in sync - I am taking the token names in the lexer.cs file, and putting them, in that order, in tokens { } section of the parser.g4 file. This fixes my issue, but its a pain.

Attached are sample files.

Thanks for any help you can provide.

On Sat, Jan 18, 2020, 21:22 ericvergnaud notifications@github.com wrote:

you cannot add tokens to a parser grammar so not sure how they would not match. (unless you are missing the ‘parser grammar’ declaration in your g4?)

Le 19 janv. 2020 à 10:15, Mike Christiansen notifications@github.com a écrit :

Oh. Okay. So as long as the ID numbers match for each token it's not an issue?

Speaking of, is there an easy way of ensuring the token ids match in each grammar? If I add tokens to the lexer grammar, I have to make sure I add them to the parser grammar in the exact same order.

On Sat, Jan 18, 2020, 20:58 ericvergnaud <notifications@github.com mailto:notifications@github.com> wrote:

Hi,

the lexer is where tokens are given birth the parser accessors are just shortcuts to the underlying lexer token factory

Eric

Le 18 janv. 2020 à 23:01, Mike Christiansen < notifications@github.com mailto:notifications@github.com> a écrit :

C# Runtime, Nuget package version 4.7.2, ANTLR version antlr-4.8-complete.jar

Hello! I have two grammar files. One is a lexer grammar (so I can use modes) and the other is a parser grammar.

I have written a custom token, inheriting from CommonToken. I created a token factory, implementing ITokenFactory.

I can set the token factory of the lexer just fine, using the below code:

lexer.TokenFactory = tokenFactory;

But, I cannot set the parser's Token Factory property, since it is read only.

I would expect to be able to use this code:

parser.TokenFactory = tokenFactory;

Is there something I am missing? I did search for information, and what I found about the Java runtime implies this is possible (in general), but I cannot see how to do it with the C# runtime.

Thanks in advance!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <

https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AAZNQJAXU36DHQ4QEM5PQ3LQ6MKUPA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHDP46Q < https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AAZNQJAXU36DHQ4QEM5PQ3LQ6MKUPA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHDP46Q , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAZNQJBCA7TQQ7Z5SLS37Q3Q6MKUPANCNFSM4KITOGUA < https://github.com/notifications/unsubscribe-auth/AAZNQJBCA7TQQ7Z5SLS37Q3Q6MKUPANCNFSM4KITOGUA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AI4GGVFAQM7PDNNR3DTVNSTQ6OXSVA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKGUDY#issuecomment-575957519 < https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AI4GGVFAQM7PDNNR3DTVNSTQ6OXSVA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKGUDY#issuecomment-575957519 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AI4GGVBJ44L5MXRL2IAA7HDQ6OXSVANCNFSM4KITOGUA < https://github.com/notifications/unsubscribe-auth/AI4GGVBJ44L5MXRL2IAA7HDQ6OXSVANCNFSM4KITOGUA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AAZNQJB3COYR3VNGYQOTX5LQ6OZTRA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKG3YQ#issuecomment-575958498>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZNQJCR2VYOTAII4HU23XTQ6OZTRANCNFSM4KITOGUA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/2726?email_source=notifications&email_token=AI4GGVECCJHTAERNM5G4S33Q6O2PFA5CNFSM4KITOGUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKG7GI#issuecomment-575958937, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4GGVABVKZXIVP46XFYDSTQ6O2PFANCNFSM4KITOGUA .