antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.19k stars 3.71k forks source link

Does antlr have a length limit for parsing c/cpp code? #3599

Open yaosheng-zhang opened 1 year ago

yaosheng-zhang commented 1 year ago

Does antlr have a length limit for parsing c/cpp code? I'm using antlr to parse a 2000 line c code file, but the parser can only parse up to 500 lines, when I delete the first 500 lines it parses a few hundred lines. How to solve the length limitation?

kaby76 commented 1 year ago

Please provide a few more details.

yaosheng-zhang commented 1 year ago

The following two files a java code is parsed c language code, another cfile is to have 378 lines of code, I use antlr to parse the cfile but the parse result is only 34 lines, please help! cfile.txt code.txt

kaby76 commented 1 year ago

... I use antlr ...

What version of Antlr are you using?

OK, you are using the "Java" target.

For the c grammar, using Antlr 4.13.0, the CSharp (dotnet 7.0.305) target, on Ubuntu 20.04.6, on an AMD Ryzen 7 2700 Eight-Core Processor, 16GB DDR4, code.txt and cfile.txt both take each about 0.13 s. cfile.txt is 377 lines long, code.txt 50 lines (wc cfile.txt code.txt). Neither of these is over 500 lines long.

I tried it on a 1k line file from the GCC testsuite (Wmisleading-indentation.c). Took about the same amount of time.

NB: pre-processor directives should be ignored, but it looks like the c grammar parses only two types of directives. That's wrong. https://github.com/antlr/grammars-v4/issues/3601

kaby76 commented 1 year ago

Updated the grammar for parsing preprocessor directives. https://github.com/antlr/grammars-v4/pull/3602

For the Java target, using "grouped parsing" (aka "warm up parsing"), these are the runtimes for each of the test files.

07/11-12:05:44 ~/issues/g4-3601/c/Generated-Java
$ bash run.sh ../examples/*.c
Java 0 ../examples/add.c success 0.038
Java 1 ../examples/BinaryDigit.c success 0.001
Java 2 ../examples/bt.c success 0.04
Java 3 ../examples/dialog.c success 0.002
Java 4 ../examples/FuncCallAsFuncArgument.c success 0.01
Java 5 ../examples/FuncCallwithVarArgs.c success 0.009
Java 6 ../examples/FuncForwardDeclaration.c success 0.002
Java 7 ../examples/FunctionCall.c success 0.003
Java 8 ../examples/FunctionPointer.c success 0.009
Java 9 ../examples/FunctionReturningPointer.c success 0.004
Java 10 ../examples/helloworld.c success 0.0
Java 11 ../examples/integrate.c success 0.013
Java 12 ../examples/ll.c success 0.002
Java 13 ../examples/ParameterOfPointerType.c success 0.001
Java 14 ../examples/pr403.c success 0.0
Java 15 ../examples/TypeCast.c success 0.007
Java 16 ../examples/Wmisleading-indentation.pp.c success 0.073
Total Time: 0.405
07/11-12:06:00 ~/issues/g4-3601/c/Generated-Java
kaby76 commented 1 year ago

OK. "code.txt" is your driver code for the Java target.

"cfile.txt" is NOT a C-language file. It's a C++ source file. For example, it contains a class declaration "class ImageServer". Classes do not exist in the C language. So, you are using the wrong grammar.

This cannot be parsed by c grammar. It's cpp grammar. Starting over........

kaby76 commented 1 year ago
$ tail -n +15 /c/Users/Kenne/Downloads/cfile.txt | head
STRICT_MODE_OFF
#include "json.hpp"
STRICT_MODE_ON
#include <iostream>
using namespace mavlink_utils;
using namespace mavlinkcom;
extern std::string replaceAll(std::string s, char toFind, char toReplace);
void UnitTests::RunAll(std::string comPort, int boardRate)
{
    com_port_ = comPort;
07/11-12:27:04 ~/issues/g4-3601/cpp/Generated-Java

This input is C++ source code, and code that is before preprocessing. It cannot be parsed cleanly with the cpp grammar because the macro call STRICT_MODE_OFF is not a C++ statement. The input should be the source code after preprocessing.

However, with the cpp grammar, the input is parsed with error, rather slowly.

$ bash run.sh /c/Users/Kenne/Downloads/cfile.txt
line 19:0 no viable alternative at input 'STRICT_MODE_OFF#include "json.hpp"\rSTRICT_MODE_ON#include <iostream>\rusing'
Java 0 C:/Users/Kenne/Downloads/cfile.txt fail 1.498
Total Time: 1.662
07/11-12:37:18 ~/issues/g4-3601/cpp/Generated-Java
yaosheng-zhang commented 1 year ago

... I use antlr ...

What version of Antlr are you using?

OK, you are using the "Java" target.

For the c grammar, using Antlr 4.13.0, the CSharp (dotnet 7.0.305) target, on Ubuntu 20.04.6, on an AMD Ryzen 7 2700 Eight-Core Processor, 16GB DDR4, code.txt and cfile.txt both take each about 0.13 s. cfile.txt is 377 lines long, code.txt 50 lines (wc cfile.txt code.txt). Neither of these is over 500 lines long.

I tried it on a 1k line file from the GCC testsuite (Wmisleading-indentation.c). Took about the same amount of time.

NB: pre-processor directives should be ignored, but it looks like the c grammar parses only two types of directives. That's wrong. #3601

I'm using antlr 4.9 in maven, which means that if my .c file exceeds 500 lines it can't be parsed? I downloaded the c.g4 from the official antlr repository or do I need to preprocess the data myself? Is there a .g4 file that can parse both cpp and c?