antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.12k stars 3.28k forks source link

Generated C++ code is ill-formed in C++ 20 standard. #2991

Open g199209 opened 3 years ago

g199209 commented 3 years ago

A very simple hello grammar file:

grammar Hello ;
firstRule : 'hello' ID ;
ID: [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;

Generated HelloLexer.cpp contains following code snippet:

std::vector<std::string> HelloLexer::_ruleNames = {
  u8"T__0", u8"ID", u8"WS"
};

std::vector<std::string> HelloLexer::_channelNames = {
  "DEFAULT_TOKEN_CHANNEL", "HIDDEN"
};

std::vector<std::string> HelloLexer::_modeNames = {
  u8"DEFAULT_MODE"
};

std::vector<std::string> HelloLexer::_literalNames = {
  "", u8"'hello'"
};

std::vector<std::string> HelloLexer::_symbolicNames = {
  "", "", u8"ID", u8"WS"
};

These code will raise compile error when using --std=c++2a. In current latest C++20 proposal, using u8"xxx" to construct std::string is forbidened.

This question in StackOverflow discuss about this issue: C++20 with u8, char8_t and std::string

The proposal P1423R2 gives more details.

This proposal also gives us some ways to deal with this case, and I simply add -fno-char8_t in my g++ flag to get rid of this problem, which is just a short-term solution.

I think we should modify code generation template & C++ runtime to support this new change in C++20.

g199209 commented 3 years ago

@mike-lischke

mike-lischke commented 3 years ago

OK, if someone could file a PR here we can take a look. Note: I cannot myself test C++20 code currently, so we also need associated test settings for that.

pjonsson commented 2 years ago

Is this issue fixed by commit 09eb905332c3abe?

wh1t3lord commented 2 years ago

I confirm, it is impossible to compile the generated code.

First issue with u8 strings, they can't be casted to std::string because char and u8 strings are different types in C++20.

Another problem is antlr4::atn::SerializedATNView class, it accepts only int32_t vector or int32_t* array, so it means it doesn't accept the type of _serializedATN variable (its std::vector).

Probably it is better to support C++20 due to its break changes with other standards or just support both standards (C++17/C++20).

My test case was generated parser for CSS3, took grammar from https://github.com/antlr/grammars-v4

iagaponenko commented 1 year ago

As of now, nearly 3 years after the problem was discovered, it's still impossible to compile the generated code.