antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.01k stars 3.69k forks source link

!(bool)true is not acceptable on cpp #2328

Open HeeMyung opened 2 years ago

HeeMyung commented 2 years ago

Hello, there is a error with parsing on !(bool) with cpp parser. I'm pretty newbie on here. so I can't figure out why it does...

kaby76 commented 2 years ago

In the future, please indicate the grammar you are using, and the input, so we can reproduce what you are seeing.

Grammar: cpp.

Input:

#include <iostream>
main()
{
    std::cout << "Hello World!";
    bool x = !(bool)true;
    return 0;
}

Output (from a C# target parser generated using trgen):

$ ./bin/Debug/net5.0/Test.exe -file ../examples/helloworld.cpp
line 6:16 no viable alternative at input '!(bool)'
line 6:16 no viable alternative at input '!(bool)'
Time: 00:00:00.1567071
Parse failed.

Notes:

kaby76 commented 2 years ago

Not clear what version the grammar in the repo (cpp) is supposed to implement. The files in the directory say "14" as in ISO148822:2014, a working draft here. But, it's missing whole rules from the spec. For example, just look at the rule for primary-expression, "adobe pdf page" 1229. Come on people.

Also, the directory name for the grammar in the repo is bogus. It should not be "cpp". "cpp" stands for the C preprocessor. Use "c++2014" or "c++" or "whatever!". Just don't name it "cpp". "cpp" on Msys2 or a linux is the C preprocessor, "g++" the compiler. On Windows, Visual Studio calls the compiler "cl".

HeeMyung commented 2 years ago

In the future, please indicate the grammar you are using, and the input, so we can reproduce what you are seeing.

Grammar: cpp.

Input:

#include <iostream>
main()
{
  std::cout << "Hello World!";
  bool x = !(bool)true;
  return 0;
}

Output (from a C# target parser generated using trgen):

$ ./bin/Debug/net5.0/Test.exe -file ../examples/helloworld.cpp
line 6:16 no viable alternative at input '!(bool)'
line 6:16 no viable alternative at input '!(bool)'
Time: 00:00:00.1567071
Parse failed.

Notes:

  • The lexer grammar contains a non-standard predicate. It should be placed in a base class, and implemented just as a semantic predicate rather than an action-block/throw.
  • The input compiles and runs fine with g++. (g++ helloworld.cpp; ./a.exe)
  • bool x = ! true; parses. bool x = (bool) 1; parses.
  • The tokens for input are: [@0,0:19='#include \r',<9>,channel=1,1:0] [@1,23:26='main',<132>,3:0] [@2,27:27='(',<85>,3:4] [@3,28:28=')',<86>,3:5] [@4,31:31='{',<89>,4:0] [@5,35:37='std',<132>,5:1] [@6,38:39='::',<127>,5:4] [@7,40:43='cout',<132>,5:6] [@8,45:45='<',<102>,5:11] [@9,46:46='<',<102>,5:12] [@10,48:61='"Hello World!"',<4>,5:14] [@11,62:62=';',<128>,5:28] [@12,66:69='bool',<14>,6:1] [@13,71:71='x',<132>,6:6] [@14,73:73='=',<101>,6:8] [@15,75:75='!',<100>,6:10] [@16,76:76='(',<85>,6:11] [@17,77:80='bool',<14>,6:12] [@18,81:81=')',<86>,6:16] [@19,82:85='true',<5>,6:17] [@20,86:86=';',<128>,6:21] [@21,90:95='return',<59>,7:1] [@22,97:97='0',<1>,7:8] [@23,98:98=';',<128>,7:9] [@24,101:101='}',<90>,8:0] [@25,106:105='',<-1>,10:0]

Thank you! I'll do like that next time. Btw, this is the exact problem I have.

kaby76 commented 2 years ago

I'm scraping the grammar from scratch from the ISO spec, minus all the semantic predicates that will need to be added. I'll let folks know how it goes. I have a feeling it will fix this problem.

HeeMyung commented 2 years ago

I'm scraping the grammar from scratch from the ISO spec, minus all the semantic predicates that will need to be added. I'll let folks know how it goes. I have a feeling it will fix this problem.

Oh! could you provide that one for me?

kaby76 commented 2 years ago

It will take several days, possibly a week. The scrape of the text will be done via a program that I have to write. I have to re-OCR the entire spec, because what I do read already, the "opt" subscripts are not being associated with the text that occurs just prior to the "opt". What no one should do nowadays is type grammars in from scratch, as well as not document what transformations they use on the scraped grammar.

kaby76 commented 2 years ago

An initial version of the scraper, which is in C#, is here. It does not yet output a correct grammar. So far, it outputs a grammar in pseudo Antlr4 syntax with at least the correct number of rules, with comments for sections and page footers. A "select-all/copy" of text using a PDF reader such as Adobe Acrobat, or Google Chrome, does not retain spaces that well, hence the reason for the scraper. I will continue to refine the program until I have a working C++ grammar. I will be using Trash to refactor issues beyond the syntax, including the insertion of semantic predicates required by the spec. Finally, I'll scrape the other versions (2018, 2022) to get separate grammars for each of those versions.

kaby76 commented 2 years ago

This problem has to do specifically with these rules in the lexer:

BooleanLiteral: False_ | True_;
False_: 'false';
True_: 'true';

You cannot define more than one rule for literals 'true' or 'false'. It must be exactly one way. The lexer will always return BooleanLiteral, not True or False because those rules occur after BooleanLiteral. @Marti2203 made the change here but I don't know why. I will see what else Martin changed in the original CPP14.g4 grammar. I don't know if there is an accompanying Issue in Git (with a detailed explanation why the changes) but it doesn't appear there is one.

There are 16 other issues with the C++ grammar. I will try to fix the open bugs for C++.

My scraper/refactoring script is getting further, but there is still a bit to do. I found that there are 31 drafts of the C++ spec here.

Marti2203 commented 2 years ago

Hi @kaby76, I wanted to fix up the grammar and make it more conformant to the ISO and ofcourse finally work as it cannot parse some things easily... I am currently not involved with the grammar things and will probably not be working on this in the next couple of months.

kaby76 commented 2 years ago

The grammar here did not implement preprocessor directives. I have an implementation for this now, but I'm not sure how I want to package the code here for testing yet. I am trying to use open source of C++ code for tests.

HeeMyung commented 2 years ago

The grammar here did not implement preprocessor directives. I have an implementation for this now, but I'm not sure how I want to package the code here for testing yet. I am trying to use open source of C++ code for tests.

maybe I could test that with my code. How can I do that?

kaby76 commented 2 years ago

@HeeMyung I haven't yet checked the code in. Still much work to do. I'm not sure if I need to partition the preprocessor grammar and the C++ grammar. The preprocessor and C++ grammar are combined in Annex A in the C++ Spec doc, but they really are different grammars, with separate start rules, and more importantly, different lexer rules. It is probably better to separate them completely.

There are a number of problems in the current cpp grammar here in this repo. Probably the most outlandish thing is primaryExpression: literal+ | ..., which is an attempt to allow string concatenation. But, it allowed if (1 2 3 4 5) .... The spec is clear though: string concatenation is handled in the preprocessor.

Early next week it should be ready.

HeeMyung commented 2 years ago

primaryExpression: literal+ | ... also pops errors on my code too. Actually I'm free for these bugs because I use my parser only for extracting some of function declarations. But It's quite disturbing every time I run this. And I have little knowledge about these grammar things, so I can't help you guys sorry.

kaby76 commented 2 years ago

An update: the preprocessor and C++ rules in the grammar are intermingled and can't be easily separated (I tried). The preprocessor rules use expression in the C++ side of the grammar. That is kind of surprising because one can create some very complicated expressions in C++ (e.g., an assignment expression), which I don't think one can do with preprocessor symbols.

I don't have a tool in Trash that can yank off the rules used from the preprocessing_file rule in the grammar in order to partition the grammar into just C++ and another grammar for just the preprocessor. (I did have this functionality in an earlier version of Trash, and the code to do so does still exist deep in Trash's base layer for manipulating grammars.)

For now, the lexer mode will need to be set in order to tokenize input as preprocessor tokens. That means the grammar has to be split into lexer and parser grammars because lexer modes are not available in combined grammars. The parser grammar requires a predicate in order to implement the rule "A text line shall not begin with a # preprocessing token." as stated in the Spec.

The preprocessor itself is an Antlr Visitor. It computes a dictionary of preprocessor symbols and interprets the '#'-directives, accumulating the state of '#'-directives along the way. The output for the preprocessor is a text buffer that the C++ grammar parses.

I'm not sure when I'll have this complete, but likely not till next week, and even then, just a prototype albeit pretty good. A prototype of just preprocessing is here. It is extremely preliminary.

kaby76 commented 2 years ago

Status update: I have an implementation of the grammar (here) that wraps the preprocessor into a special method called "start", which is implemented in the base class for the parser. "start" looks like a rule, where you call the parser just the same ("parser.start()"), but it's not actually in any grammar. The method does quite a bit of juggling: create a new parser and lexer for preprocessing, evaluate the preprocessing rules to compute the output, then replace the input stream for the parser with this new input and call the regular grammar entry point ("translation_unit"). Expressions are the biggest issue since it requires a lot of code to be implemented in the preprocessor visitor class. I have implemented only a minimal set for now. A discussion of the issues in scraping the C++ grammar is here in my blog.

DonMathi commented 2 years ago

I think the problem is in CPP14Parser.g4 at:

    assignmentExpression:
    conditionalExpression
    | **logicalOrExpression** assignmentOperator initializerClause
    | throwExpression;

This should maybe:

    assignmentExpression:
    conditionalExpression
    | **logicalNotExpression** assignmentOperator initializerClause
    | throwExpression;

The logicalNotExpression is missing completly from the grammar

kaby76 commented 2 years ago

The logicalNotExpression is missing completly from the grammar

No, that is not the problem. I pointed out here that the lexer grammar has lexer rule BooleanLiteral that hides the tokens True_ and False_. The solution is to rename the rule as a parser rule, i.e., boolean_literal. You can give it a try and see that it fixes the specific problem. But, I haven't submitted a PR because the entire grammar should be replaced.

The grammar in this repo is derived from the C++14 Spec draft n4296, from the C++ Standards Committee. The draft does not have a logical-not-expression but it does have a unary-expression with a unary-operator, which can derive !. The drafts for C++17 n4660 C++20 n4878 more or less also do not contain a logical-not-expression. We should not be second-guessing the C++ Standards Committee. They have been studying problems with the language and grammar for decades.

The cpp (C++14) grammar in this repo contains numerous and incorrect changes over the official grammar from the Spec. I describe some of those changes and why some are really bad in a table at the end of this blog entry.

I have been rewriting the grammar for C++14 and I will also have grammars for C++17/20/23. The work involves:

This has taken so far a couple of months of work. Unfortunately, I don't know when I'll be finished.

--Ken