antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.11k stars 3.28k forks source link

[Cpp] Can't use antlr and cstdio together because of undef EOF #2914

Closed cangfengzhs closed 2 years ago

cangfengzhs commented 4 years ago

I use another lib(brpc) which use macro EOF, but anltr(4.8) undef it in antlr-common.h. In order to use them at the same time, I need to strictly control the order of the include, otherwise the compilation will fail. How to solve this problem?

Marti2203 commented 4 years ago

Hi, Copy the definition before including antlr-common.h and put it back afterwards?

cangfengzhs commented 3 years ago

I think EOF should be replaced by _EOF and delete undef? After all, EOF is defined by the standard library. Although this problem can be avoided to some extent by modifying the include order of header file, it will be adjusted when using cpplent to format code automatically.

shahms commented 2 years ago

EOF is a reserved identifier (by virtue of being defined by the C standard library). User programs cannot redefine it and Antlr should not generate code which does. And it certainly shouldn't be undefining it in a common header.

parrt commented 2 years ago

I'm fine with such a change if EOF reserved is C++ (not just C). @mike-lischke @jcking have an opinion?

parrt commented 2 years ago

How much will it break if we added EOF to the keywords list so that is generated EOF_?

parrt commented 2 years ago

@jcking points out:

Basically anywhere EOF is mentioned outside comments, you have to add this before:

#pragma push_macro("EOF")
#undef EOF

And this after:

#pragma pop_macro("EOF")

And then remove the #undef EOF in the header.

shahms commented 2 years ago

It's reserved in C++ as well:

https://eel.is/c++draft/macro.names https://eel.is/c++draft/cstdio.syn

mike-lischke commented 2 years ago

Tbh. I don't see a need to change that. It has been this way since 2016 and the vast majority of the C++ ANTLR4 user have not had any problem with it. We cannot please everyone.

shahms commented 2 years ago

The issue here, as is often the case with undefined behavior, is fragility. UB can happen-to-work for quite a while until it doesn't and then you're hit with a variety of problems. At least here, it's a compilation failure rather than silent corruption at runtime, but it can happen for a number of reasons including that a transitive dependency changed the order of #include files. When this happens, there isn't much recourse for users short of omitting mention of EOF from their grammar (against the "best practices" documentation), not updating a dependency (potentially leaving them open to security vulnerabilities) or hackily adjusting the generated code with sed or equivalent.

While it's true that you "cannot please everyone" it's also the case that the current situation is a ticking time bomb for your currently-pleased users.

parrt commented 2 years ago

At least with one example we have found that removing a reference to EOF from the grammar (start rule) allows the generated code to build even if you include cstdio. To get the same functionality, you can add a quick check after the parse to ensure that the token stream index is the length of the token stream. This is a hack but I'm not sure how to overcome the collision with EOF macro. Just leaving this note here for other travelers.

For the next major release I can see adding EOF to the reserved keywords list and having it convert to EOF_ Or, we can try the #undef trick.

Thanks everyone!

mike-lischke commented 2 years ago

@shahms I don't think this issue is a potential time bomb. It's not like it just happened to work. Quite the opposite: only very few user used the EOF C macro and knew how to work around the #undef in the ANTLR4 C++ runtime. Most C++ users didn't even see a problem, because it is related to a very narrow area (using C stuff in a C++ application).

@parrt A simpler solution is to change the include order of header files. That usually works well already. But @shahms is right here, it can happen that you cannot change that order. The "correct" solution would be that ANTLR4 doesn't use EOF as it is a special word in C/C++ (not reserved, but still).

parrt commented 2 years ago

Ok, I've played around with this problem and thought about it. I just can't see breaking all previous ANTLR code that references EOF by renaming it to EOF_. Yes, we should have avoided EOF at the beginning but I can't change reality; I have to balance correctness with backward compatibility.

The "solution" is to avoid referencing EOF in your actual grammar or to add #undef EOF in a section to get it into the generate a code in the right spot. Something like this:

@parser::members {
#undef EOF
}

I also can't change the fact that C++ uses include files from the 1960s and that order matters. Oh and that it still has a preprocessor with macros that can completely change the syntax. C++ is over 40 years old now and still accreting features aggressively. Let's see if we can make it to 50! [snark]

parrt commented 2 years ago

Happy to reopen this if we get a better solution!

parrt commented 2 years ago

Removing EOF refs from the grammar seems to work but we still need to verify that the parser is at the end of input when it finishes. I had a thought: Rather than manually checking to see that we are at the end of input after the parser finishes, I wonder if we can simply make a call to parser->match(Token::EOF) or whatever right after calling the starting parser rule. Assuming of course that we have done a manual #undef EOF before that ref.

JoD commented 1 year ago

Hey all

Just bumped into this issue when trying to use ANTLR to parse to Boost multiprecision integers:

In file included from /usr/include/boost/multiprecision/cpp_int.hpp:12,
                 from /home/jod/workspace/exact-dev/src/manyworlds/ManyWorlds.hpp:23,
                 from /home/jod/workspace/exact-dev/src/manyworlds/Expression.hpp:71,
                 from /home/jod/workspace/exact-dev/src/manyworlds/parser/Parser.cpp:22:
/usr/include/boost/multiprecision/number.hpp: In function ‘std::istream& boost::operator>>(std::istream&, boost::rational<boost::multiprecision::number<Backend, ExpressionTemplates> >&)’:
/usr/include/boost/multiprecision/number.hpp:2137:12: error: ‘EOF’ was not declared in this scope
 2137 |    while ((EOF != (c = static_cast<char>(is.peek()))) && (c == 'x' || c == 'X' || c == '-' || c == '+' || (c >= '0' && c <= '9') || (have_hex && (c >= 'a' && c <= 'f')) || (have_hex && (c >= 'A' && c <= 'F'))))
      |            ^~~
In file included from /usr/include/boost/multiprecision/cpp_int.hpp:12,
                 from /home/jod/workspace/exact-dev/src/manyworlds/ManyWorlds.hpp:23,
                 from /home/jod/workspace/exact-dev/src/manyworlds/Expression.hpp:71,
                 from /home/jod/workspace/exact-dev/src/manyworlds/parser/Parser.cpp:22:
/usr/include/boost/multiprecision/number.hpp:33:1: note: ‘EOF’ is defined in header ‘<cstdio>’; did you forget to ‘#include <cstdio>’?
   32 | #include <cctype>  // isspace
  +++ |+#include <cstdio>
   33 | #ifndef BOOST_NO_CXX17_HDR_STRING_VIEW

The grammar did not mention any EOF, so to me the proposed workaround did not work.

For now, I try to ensure that the Antlr includes are processed after the multiprecision includes, which is quite a pain, as the autoformatter likes to reorder includes, and as I do not have a good idea how exactly the compiler orders the includes in C++.

So to me, this is actually an issue that makes me think about not using Antlr. Are there any other workarounds?

cangfengzhs commented 1 year ago

This is also a key reason why I don't consider using Antlr

parrt commented 1 year ago

@JoD interesting that EOF was automatically added even if it's not in the grammar... Can you post the grammar? Often people get around things by the order of includes.

@cangfengzhs I totally understand. We are in a difficult situation because they see plus plus model was created long before we started thinking carefully about the cross platform reserved words collision problem. There is so much C++ code out there that I hate to break all of that to tweak the antlr runtime for EOF.

raffopazzo commented 1 year ago

I just hit this problem just by including boost::iostreams on a brand new project. Whilst I can see that you don't want to break existing code, this is also preventing writing new one. What about guarding your #undef with some #ifdef ANTLR4CPP_LEGACY_EOF and old users can just -DANTLR4CPP_LEGACY_EOF in their build scripts? or maybe some options { EOF=EOF_; } in the grammar so that new code can "rename away from EOF"?

wangtao9 commented 11 months ago

I also encountered the same problem. In this modification on my forked version, I tried to remove #undef EOF from the Cpp-runtime, but I don't know how to modify the demo's g4 file https://github.com/antlr/antlr4/blob/dev/runtime/Cpp/demo/TParser.g4 to generate a correct parser. The error message is as follows:

Consolidate compiler generated dependencies of target antlr4_shared
[ 47%] Built target antlr4_shared
Consolidate compiler generated dependencies of target antlr4_static
[ 93%] Built target antlr4_static
[ 94%] Built target gtest
[ 95%] Built target gtest_main
[ 96%] Built target antlr4_tests
[ 96%] Built target gmock
[ 97%] Built target gmock_main
[ 97%] Built target GenerateParser
Consolidate compiler generated dependencies of target antlr4-demo
[ 98%] Building CXX object demo/CMakeFiles/antlr4-demo.dir/Linux/main.cpp.o
In file included from /usr/include/c++/11/cstdio:42,
                 from /usr/include/c++/11/ext/string_conversions.h:43,
                 from /usr/include/c++/11/bits/basic_string.h:6608,
                 from /usr/include/c++/11/string:55,
                 from /usr/include/c++/11/bits/locale_classes.h:40,
                 from /usr/include/c++/11/bits/ios_base.h:41,
                 from /usr/include/c++/11/ios:42,
                 from /usr/include/c++/11/ostream:38,
                 from /usr/include/c++/11/iostream:39,
                 from /home/wt/antlr4/runtime/Cpp/demo/Linux/main.cpp:13:
/home/wt/antlr4/runtime/Cpp/demo/generated/TParser.h:82:33: error: expected unqualified-id before ‘-’ token
   82 |     antlr4::tree::TerminalNode *EOF();
      |                                 ^~~
/home/wt/antlr4/runtime/Cpp/demo/generated/TParser.h:82:33: error: expected ‘)’ before ‘-’ token
   82 |     antlr4::tree::TerminalNode *EOF();
      |                                 ^~~

I need some guidance.

JoD commented 9 months ago

@JoD interesting that EOF was automatically added even if it's not in the grammar... Can you post the grammar? Often people get around things by the order of includes.

The grammar is here: https://gitlab.com/nonfiction-software/manyworlds/-/blob/main/src/antlr/manyworlds.g4?ref_type=heads Having a look, it does use EOF, so I might be mistaken there.

Indeed, manually fixing the order of includes was a feasible workaround. Feels brittle and had to figure out how to forbid the code formatter from touching these includes, but works for now.