antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.18k stars 3.29k forks source link

Rule name changed, extra time costs. #3890

Open hagen666 opened 2 years ago

hagen666 commented 2 years ago

Details: I downloaded cypher grammer from openCypher website, parsing 1000+ queries costs 300ms in c++ target. However when I removed the suffix 'oC_' in the grammer, the parsing time turned to be 7 seconds.

Does different rule name have different parsing time in C++? To verify this, I cannot reproduce the issue in Java target, but only in C++ target.

kaby76 commented 2 years ago

We have the original grammar, but not the input. Please provide the input string or file so we can investigate the parser.

hagen666 commented 2 years ago

It's a subset can reproduce the issue:

match (n),(m) where id(n)= '46' and id(m)= 'idsx' create (n)-[r:label3]->(m) return r
match (n)-->(m) with n, count(*) as p where p > 1 return n,p
match (n)--(m) with n, m limit 100 with n, collect(m) as c, count(*) as p where p > 1 with n,c,p match (n)-->(m1) where m1 in c return m1, p, c limit 10
match (n) return case n.propName8='enumVal1' when true then 'yes' when false then 'no' else (case n:label2 when true then 'unknown' when false then labels(n) end) end
match (n) return case n.propName8 when 'enumVal1' then 'Female' else 'Male' end
match (n) return case n.propName8 when 'enumVal1' then 'Female' when 'idsx4' then 'Male' end
match (n) return count(*)
match (n) return count(distinct n.propName9)
match (n) return degree(n), inDegree(n), outDegree(n) limit 1
match (n) return distinct n.propName9
match (n) return distinct toUpper(n.propName3), toLower(n.propName8)
match(n) return labels(n), count(*)
match (n) return max(n.propName7), max(n.propName2),min(n.propName7), min(n.propName2)
match (n) return n;
match (n) return n['propName9']
match (n) return n.propName9, collect((n:label2))
match(n) return n.propName3, count(*)
match(n) return n.propName3, count(n.propName7)
match (n) return n order by n.propName1 limit 1000
match (n) return [(n)-[r:label3]->(m)|m]
match (n)-[r:labelX*1..2]->(m) return count(*)
match (n)-[r:labelX*1..3]->(m) return n, m
match (n)-[r]->(m:label1) return count(*)
match (n)-[r]-(m)   return avg(toFloat(r.propName11)),avg(toDouble(r.propName11)),avg(toInteger(r.propName11))
match (n)-[r]-(m)   return sum(toFloat(r.propName11)),sum(toDouble(r.propName11)),sum(toInteger(r.propName11))
match (n:label2{propName4:'18-24'}) return n limit 10
MATCH (n:label2 { propName4: 'keywords2', `propName5`: 'propVal3'}) DELETE n
MATCH (n:label2 { propName4: 'keywords2', `propName5`: 'episode2'}) detach delete n return n
match (n:label2),(m:label1)with {a:n.propName7,b:m.propName1} as p,n,m where p.a=['S'+p.b] return n,m
match (n:label2{propName3:'idsx3'}) return n skip 1
match (n:label2{propName3:'idsx2'}) create (m:label2{`propName5`:'99999'}) return n,m
match (n:label2) return n, n.propName4 limit 1000
match (n:label2) return n skip 10
match (n:label2)-[r:label3]->(m) return count(*)
match (n:label2)  where id(n)='username' set  n.propName12='longlongstringlonglongstringlonglongstring' return n.propName12 union match (n:label2)  where id(n)='Mercedes' set  n.propName12='apsoutheast1b-POD04-C6-4-Common002-CNA3X4' return n.propName12 union match (n:label2)  where id(n)='Katherine' set  n.propName12='apsoutheast1b-POD04-C6-4-Common002-CNA3X5' return n.propName12 union match (n:label2)  where id(n)='Stuart' set  n.propName12='longlongstringlonglongstringlonglongstring2' return n.propName12 union match (n:label2)  where id(n)='Jacob' set  n.propName12='apsoutheast1b-POD04-C6-4-Common002-CNA3XX' return n.propName12
match (n:label2) where  n.propName12> 'longlongstringlonglongstringlonglongstring' and n.propName12 < 'longlongstringlonglongstringlonglongstring2' return n
match (n:label2) where  n.propName12='longlongstringlonglongstringlonglongstring' and n.propName3='idsx3' return n
match (n:label2) where  n.propName12='longlongstringlonglongstringlonglongstring' and n.`propName5`='episode2' return n
match (n:label2) where n.propName7 > 10 and n.propName7 < 20  return n
match (n:label2) where n.propName7 >= 5 and n.propName7 <= 10 return n
match (n:label2) where n.propName7 > 5 and n.propName7 < 10 return n
match (n:label2) where n.propName7=5 return n
match (n:label2) where n.propName7 > 5 return n
match (n:label2) with collect(n) as person return [x in person where x.propName8='enumVal1']
match (n:label2) with collect(n) as person return [x in person where x.propName8='enumVal1'|x.propName7]
match (n:label2) with collect(n) as person return [x in person|x.propName7]
MATCH (n:label2{`propName5`:'p22'})-[r]->(m:label1{propName2:7}) SET r = {rating: 99}
MATCH (n:label2 {`propName5`: 'p22'}) SET n.`high`= 'propVal3' return n
MATCH (n:label2 {`propName5`: 'p22'}) SET n = {`propName5`:'propVal3'} return n
MATCH (n:label2 {`propName5`: 'p22'}) SET n.`propName5`= 'propVal3' return n
match (n)  where {a:n}={a:n} return n
match (n) where id(n)='1' return n;
match (n) where id(n) in ['1','2'] return n
match (n) where id(n) in ['46', '38'] return size((n)-->()),size((n)<--()), size((n)-[]-()),size((n)-[:label3]->()),size((n)<-[:label3]-()), size((n)-[:label3|:label31]-())
match (n)  where n in [n,1] return n
match (n) where n.propName2 is null return n
match (n) where n.propName3 <> null return n
match (n) where n.propName3 = null return n
match (n) where n.propName3 starts with null return n
match (n) where n.propName7 > 1 and n.propName7 < 10 return n limit 10
match (n) where n.propName7 in null return n
match (n) where n.propName7 is not null return n
match (n) where n.propName7 is null return n
match (n) where n.propName7 < null return n
match (n) where n.propName7 <> ';' return n
match (n) with n.propName9 as c, collect(n) as p return c, p[1] order by c
match (n) with [(n)-[r:label3]->(m)|m][2] as n1, [(n)-[r:label3]->(m)|m][0..1] as li return li, [(n1)-[r]-(m)|m][0..1] as n2
MATCH (p:label1 { propName9:'episode' }) MATCH (q:label1 {propName9: p.propName9 }) RETURN p, q
MATCH (p:label1 { propName9:'episode' })  where not (p)<--(:label1) return count(p)
MATCH (p:label1 { propName9:'episode' })  where (p)<--(:label2{propName8:'idsx4'}) return p
match p=(n)-[*1..3]->(m) return any (x in nodes(p) where x.propName8='enumVal1'), single (x in nodes(p) where x.propName8='enumVal1'), all (x in nodes(p) where x.propName8='enumVal1'), none (x in nodes(p) where x.propName8='enumVal1')  limit 3
match p=(n)-[*1..3]->(m) where any (x in nodes(p) where x.propName8='enumVal1') return p limit 3
match p=(n)-[*3..3]->(m) where id(n)='username' and all(x in nodes(p) where labels(x)='label2') return p limit 10
match p=(n)-[:labelX*3..3]->(m) where id(n)='username' and all(x in relationships(p) where type(x)='friends') return p limit 10
match p=(n)-[r*1..2]->(m) return  size(r), length(p), head(nodes(p)), last(nodes(p)), head(relationships(p)), last(relationships(p)) limit 10
match p = (n)-[r1:labelX*1..2{propName13:8}]->(m)-[r2:labelX*1..2{propName13:8}]->(c) where id(n)='William' return p
match p = (n)-[r1:labelX*1..2]->(m)-[r2]->(c) where id(n)='username' return p limit 100
match p = (n)-[r1:labelX*3..4{propName13:8}]->(m) return p
match p = (n)-[r1:labelX{propName13:8}]->(m)-[r2:labelX{propName13:8}]->(c) where id(n)='William' return p
match p=(n)-[r:labelX*1..2]->(m) return p
match ()-[r]->() return distinct r['propName14']
return {key:123,mon:[1,1,{so:'1235'}],key2:12} = {key2:12,key:123, mon:[1,2,{so:null}]}
return null in []
unwind [[1,2],[10,11,12], 9] as p return p[1..],p[-1..], p[1..2], p[..-2], p[-1..2], p[..2], p[0..-1], p[-2..-1]
Unwind [1,2,3] as p return p
unwind [[1,2,3], [], null] as p return head(p), last(p), size(p)
unwind [1,3.5,'abc',[1,2,3], {a:1, b:2},null] as p with [size(p)] as c return collect(c)
unwind [{a:1,b:2}] as p return p['a']
unwind [[null], 1,[1,2],'a',['bc'],3,null] as x unwind x as p return p
unwind [toLong('9223372036854775807'), toDouble('1.79769e+308'),toDouble('1.79769e+308'),'1.79769e+308'] as p  return sum(p), sum(toDouble(p))
with [1,2,3,4,5,6,7] as p return p[..-1],p[1..]
with [1,2] as a return a
with [1,2] in [[1,2],3] as p return p
with {A:1} as p return p
with [{a:1,b:2}] as p return p[0]
with [{a:1,b:2}] as p return p[-1]
with {A:1} in [1,2,{A:1}] as p return p
with null in [1,2,null] as p return p
hagen666 commented 2 years ago

When the prefix exists, the c++ target is as fast as the java.

kaby76 commented 2 years ago
$ ./bin/Debug/net6.0/Test.exe -file ../in.txt
line 2:0 no viable alternative at input '\r\nmatch'
Time: 00:00:00.1159912
Parse failed.

The parse for that input and exact grammar (https://s3.amazonaws.com/artifacts.opencypher.org/M19/Cypher.g4) fails. Is this input correct? Are you expecting the parse error?

hagen666 commented 2 years ago

Parse each line not the whole files. here is the C++ code:

bool ParseCypher(const std::string& statement) {
    antlr4::ANTLRInputStream input(statement);
    CCypherLexer lexer(&input);
    antlr4::CommonTokenStream tokens(&lexer);
    tokens.fill();
    CCypherParser parser(&tokens);
    auto stmt = parser.oC_Cypher();
    // output parse error here.

}
int main()
{
    std::vector<std::string> stmts;
    std::ifstream ifs("input.txt", std::ios::in);
    char line[1024];
    while (ifs.getline(line, 1024)) {
        stmts.emplace_back(line);
    }
    auto startTime = std::chrono::system_clock::now();
    size_t limit = 1000;
    size_t n = 0;
    while (n < limit) {
        for (const auto &ln : stmts) {
            ParseCypher(ln);
            n++;
        }
    }
    auto endTime = std::chrono::system_clock::now();
    std::cout << "time:" << std::chrono::duration_cast<std::chrono::microseconds>(endTime - startTime).count() << std::endl;
    return 0;
}

The java version is:

public class Main {

    public static class CypherErrorListener extends ConsoleErrorListener {

    }
    public static void ParseCypher(String v) {
        ANTLRInputStream inputStream = new ANTLRInputStream(v);
        CypherLexer lexer = new CypherLexer(inputStream);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        tokens.fill();
        CypherParser parser = new CypherParser(tokens);
        parser.addErrorListener(new ConsoleErrorListener());
        parser.oC_Cypher();

    }
    public static void main(String[] args) {
        List<String> arrayList = new ArrayList<>();
        File f = new File("input.list");
        try (BufferedReader br = new BufferedReader(new FileReader(f))) {
            String ln;
            while ((ln = br.readLine())!=null) {
                arrayList.add(ln);
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        System.out.println(arrayList.size());
        long start = System.currentTimeMillis();
        int limit = 1000;
        int count = 0;
        while (count < limit) {
            for (String str : arrayList) {
                ParseCypher(str);
                count++;
            }
        }
        long end = System.currentTimeMillis();
        System.out.println((end - start));
    }
}
kaby76 commented 2 years ago

I'm not sure why you want to incur all that overhead of creating and throwing away the parser and lexer object with each line.

For now, I'll first try to do what people normally do--parse the entire input--and investigate whether there are issues in the grammar itself as there is apparently a rename refactoring involved in the poor performance. Besides, the templates I wrote, which span all targets for Antlr, don't test this way, and I want to test this across every target/OS.

hagen666 commented 2 years ago

May be you need to edit the grammer to support multilie Query in cypher. A runnable example is:

oC_Cypher
      :  (SP? oC_Statement ( SP? ';' )? SP? CR? LF?)+ EOF ;

the origin version is:

oC_Cypher
      :  SP? oC_Statement ( SP? ';' )? SP? EOF ;

But i'm not sure the parse time.

kaby76 commented 2 years ago

How exactly did you rename each symbol? You didn't just literally delete "oC_" from "oC_AddOrSubtractExpression" to get "AddOrSubtractExpression"? Or did you rename it to "addOrSubtractExpression"?

hagen666 commented 2 years ago

I only remove 'oC_', and rename 'oC_Cypher' to 'cypher' but i have an old one, which use 4.7.2, the rule name has the first letter in low case, and still has the low performance.

kaby76 commented 2 years ago

Ok, just checking, because I followed your instructions for the edits, and came up with the wrong grammar.

So that we are on the same page, attached here is the modified grammar with start rule "start" and the refactored parser symbols. Cypher.txt

I couldn't see editing the grammar by hand with a text editor, so I wrote this script to do the refactoring. It uses Trash, my toolkit for Antlr grammars. rename.txt

kaby76 commented 2 years ago

Note, in addition to renaming parser symbols s/oC_\([A-Z]\)/\L\1/, NULL must be renamed (NULL_), and oC_Match renamed (match_), otherwise the refactored grammar does not compile for C++ due to "symbol conflict".

Attached is the refactored grammar I generated from script. Over the whole input file (attached), for a simple driver that uses one parser/lexer per entire input file, I don't see any performance issue. Cypher.txt rename.txt in.txt

ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3$ trgen -s start -t Cpp
Rendering template file from Cpp/CaseChangingCharStream.cpp to Generated/CaseChangingCharStream.cpp
Rendering template file from Cpp/CaseChangingCharStream.h to Generated/CaseChangingCharStream.h
Rendering template file from Cpp/cmake/antlr4-generator.cmake.in to Generated/cmake/antlr4-generator.cmake.in
Rendering template file from Cpp/cmake/antlr4-runtime.cmake.in to Generated/cmake/antlr4-runtime.cmake.in
Rendering template file from Cpp/cmake/Antlr4Package.md to Generated/cmake/Antlr4Package.md
Rendering template file from Cpp/cmake/ExternalAntlr4Cpp.cmake to Generated/cmake/ExternalAntlr4Cpp.cmake
Rendering template file from Cpp/cmake/FindANTLR.cmake to Generated/cmake/FindANTLR.cmake
Rendering template file from Cpp/cmake/README.md to Generated/cmake/README.md
Rendering template file from Cpp/CMakeLists.txt to Generated/CMakeLists.txt
Rendering template file from Cpp/ErrorListener.cpp to Generated/ErrorListener.cpp
Rendering template file from Cpp/ErrorListener.h to Generated/ErrorListener.h
Rendering template file from Cpp/makefile to Generated/makefile
Rendering template file from Cpp/Program.cpp to Generated/Program.cpp
Rendering template file from Cpp/readme.md to Generated/readme.md
Rendering template file from Cpp/test.sh to Generated/test.sh
Copying source file from /mnt/c/msys64/home/Kenne/ques/r3/rename.txt to Generated/rename.txt
Copying source file from /mnt/c/msys64/home/Kenne/ques/r3/in.txt to Generated/in.txt
Copying source file from /mnt/c/msys64/home/Kenne/ques/r3/Cypher.txt to Generated/Cypher.txt
Copying source file from /mnt/c/msys64/home/Kenne/ques/r3/Cypher.g4 to Generated/Cypher.g4
ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3$ cd Generated/
ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3/Generated$ make
if [ -f transformGrammar.py ]; then python3 transformGrammar.py ; fi
rm -rf build; mkdir build; cd build; cmake .. ; make
CMake Warning (dev) in CMakeLists.txt:
  No project() command is present.  The top-level CMakeLists.txt file must
  contain a literal, direct call to the project() command.  Add a line of
  code such as

    project(ProjectName)

  near the top of the file, but after cmake_minimum_required().

  CMake is pretending there is a "project(Project)" command on the first
  line.
This warning is for project developers.  Use -Wno-dev to suppress it.

-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Found ANTLR: /tmp/antlr4-4.11.1-complete.jar (found version "4.1")
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/c/msys64/home/Kenne/ques/r3/Generated/build
make[1]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[1]: Warning: File 'Makefile' has modification time 3.5 s in the future
make[2]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[2]: Warning: File 'CMakeFiles/Makefile2' has modification time 3.6 s in the future
make[3]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Warning: File 'CMakeFiles/antlr4_runtime.dir/progress.make' has modification time 3.5 s in the future
Scanning dependencies of target antlr4_runtime
make[3]: warning:  Clock skew detected.  Your build may be incomplete.
make[3]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Warning: File 'CMakeFiles/antlr4_runtime.dir/progress.make' has modification time 3.4 s in the future
[  6%] Performing update step for 'antlr4_runtime'
[ 12%] Performing configure step for 'antlr4_runtime'
loading initial cache file /tmp/antlr4_runtime/tmp/antlr4_runtime-cache-.cmake
CMake Warning at CMakeLists.txt:10 (message):
  Build type not set, falling back to Release mode.

   To specify build type use:
   -DCMAKE_BUILD_TYPE=<mode> where <mode> is Debug or Release.

-- Building without demo. To enable demo build use: -DWITH_DEMO=True
CMake Deprecation Warning at CMakeLists.txt:42 (CMAKE_POLICY):
  The OLD behavior for policy CMP0054 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

CMake Deprecation Warning at CMakeLists.txt:43 (CMAKE_POLICY):
  The OLD behavior for policy CMP0045 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

CMake Deprecation Warning at CMakeLists.txt:44 (CMAKE_POLICY):
  The OLD behavior for policy CMP0042 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

CMake Deprecation Warning at CMakeLists.txt:49 (CMAKE_POLICY):
  The OLD behavior for policy CMP0059 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

CMake Deprecation Warning at CMakeLists.txt:50 (CMAKE_POLICY):
  The OLD behavior for policy CMP0054 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/antlr4_runtime/src/antlr4_runtime/runtime/Cpp
[ 18%] No build step for 'antlr4_runtime'
[ 25%] No install step for 'antlr4_runtime'
[ 31%] Completed 'antlr4_runtime'
make[3]: warning:  Clock skew detected.  Your build may be incomplete.
make[3]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
[ 50%] Built target antlr4_runtime
make[3]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Warning: File 'CMakeFiles/antlr4_runtime-build_static.dir/progress.make' has modification time 2.9 s in the future
Scanning dependencies of target antlr4_runtime-build_static
make[3]: warning:  Clock skew detected.  Your build may be incomplete.
make[3]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Warning: File 'CMakeFiles/antlr4_runtime-build_static.dir/progress.make' has modification time 2.8 s in the future
make[3]: warning:  Clock skew detected.  Your build may be incomplete.
make[3]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
[ 56%] Built target antlr4_runtime-build_static
make[3]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Warning: File 'CMakeFiles/Test.dir/flags.make' has modification time 2.5 s in the future
[ 62%] Building Cypher with ANTLR 4.1
Scanning dependencies of target Test
make[3]: warning:  Clock skew detected.  Your build may be incomplete.
make[3]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Warning: File 'CMakeFiles/Test.dir/depend.make' has modification time 3.3 s in the future
[ 68%] Building CXX object CMakeFiles/Test.dir/Program.cpp.o
[ 75%] Building CXX object CMakeFiles/Test.dir/ErrorListener.cpp.o
[ 81%] Building CXX object CMakeFiles/Test.dir/CaseChangingCharStream.cpp.o
[ 87%] Building CXX object CMakeFiles/Test.dir/antlr4cpp_generated_src/Cypher/CypherLexer.cpp.o
[ 93%] Building CXX object CMakeFiles/Test.dir/antlr4cpp_generated_src/Cypher/CypherParser.cpp.o
[100%] Linking CXX executable Test
make[3]: warning:  Clock skew detected.  Your build may be incomplete.
make[3]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
[100%] Built target Test
make[2]: warning:  Clock skew detected.  Your build may be incomplete.
make[2]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[1]: warning:  Clock skew detected.  Your build may be incomplete.
make[1]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3/Generated$ trgen --version
trgen 0.17.0

ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3/Generated$ ./build/Test -file in.txt
Parse succeeded.
Time: 0:00.100298
ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3/Generated$ ./build/Test -file in.txt
Parse succeeded.
Time: 0:00.095476
ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3/Generated$
kaby76 commented 2 years ago

I ran your driver (refactored for start symbol and grammar names), and the parser has no performance problem.

ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3/Generated$ make
if [ -f transformGrammar.py ]; then python3 transformGrammar.py ; fi
rm -rf build; mkdir build; cd build; cmake .. ; make
CMake Warning (dev) in CMakeLists.txt:
  No project() command is present.  The top-level CMakeLists.txt file must
  contain a literal, direct call to the project() command.  Add a line of
  code such as

    project(ProjectName)

  near the top of the file, but after cmake_minimum_required().

  CMake is pretending there is a "project(Project)" command on the first
  line.
This warning is for project developers.  Use -Wno-dev to suppress it.

-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Found ANTLR: /tmp/antlr4-4.11.1-complete.jar (found version "4.1")
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/c/msys64/home/Kenne/ques/r3/Generated/build
make[1]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[1]: Warning: File 'Makefile' has modification time 3.5 s in the future
make[2]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[2]: Warning: File 'CMakeFiles/Makefile2' has modification time 3.6 s in the future
make[3]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Warning: File 'CMakeFiles/antlr4_runtime.dir/progress.make' has modification time 3.5 s in the future
Scanning dependencies of target antlr4_runtime
make[3]: warning:  Clock skew detected.  Your build may be incomplete.
make[3]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Warning: File 'CMakeFiles/antlr4_runtime.dir/progress.make' has modification time 3.4 s in the future
[  6%] Performing update step for 'antlr4_runtime'
[ 12%] Performing configure step for 'antlr4_runtime'
loading initial cache file /tmp/antlr4_runtime/tmp/antlr4_runtime-cache-.cmake
CMake Warning at CMakeLists.txt:10 (message):
  Build type not set, falling back to Release mode.

   To specify build type use:
   -DCMAKE_BUILD_TYPE=<mode> where <mode> is Debug or Release.

-- Building without demo. To enable demo build use: -DWITH_DEMO=True
CMake Deprecation Warning at CMakeLists.txt:42 (CMAKE_POLICY):
  The OLD behavior for policy CMP0054 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

CMake Deprecation Warning at CMakeLists.txt:43 (CMAKE_POLICY):
  The OLD behavior for policy CMP0045 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

CMake Deprecation Warning at CMakeLists.txt:44 (CMAKE_POLICY):
  The OLD behavior for policy CMP0042 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

CMake Deprecation Warning at CMakeLists.txt:49 (CMAKE_POLICY):
  The OLD behavior for policy CMP0059 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

CMake Deprecation Warning at CMakeLists.txt:50 (CMAKE_POLICY):
  The OLD behavior for policy CMP0054 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/antlr4_runtime/src/antlr4_runtime/runtime/Cpp
[ 18%] No build step for 'antlr4_runtime'
[ 25%] No install step for 'antlr4_runtime'
[ 31%] Completed 'antlr4_runtime'
make[3]: warning:  Clock skew detected.  Your build may be incomplete.
make[3]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
[ 50%] Built target antlr4_runtime
make[3]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Warning: File 'CMakeFiles/antlr4_runtime-build_static.dir/progress.make' has modification time 2.9 s in the future
Scanning dependencies of target antlr4_runtime-build_static
make[3]: warning:  Clock skew detected.  Your build may be incomplete.
make[3]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Warning: File 'CMakeFiles/antlr4_runtime-build_static.dir/progress.make' has modification time 2.8 s in the future
make[3]: warning:  Clock skew detected.  Your build may be incomplete.
make[3]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
[ 56%] Built target antlr4_runtime-build_static
make[3]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Warning: File 'CMakeFiles/Test.dir/flags.make' has modification time 2.5 s in the future
[ 62%] Building Cypher with ANTLR 4.1
Scanning dependencies of target Test
make[3]: warning:  Clock skew detected.  Your build may be incomplete.
make[3]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Entering directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[3]: Warning: File 'CMakeFiles/Test.dir/depend.make' has modification time 3.2 s in the future
[ 68%] Building CXX object CMakeFiles/Test.dir/Program.cpp.o
/mnt/c/msys64/home/Kenne/ques/r3/Generated/Program.cpp: In function ‘bool ParseCypher(const string&)’:
/mnt/c/msys64/home/Kenne/ques/r3/Generated/Program.cpp:118:1: warning: no return statement in function returning non-void [-Wreturn-type]
  118 | }
      | ^
[ 75%] Building CXX object CMakeFiles/Test.dir/ErrorListener.cpp.o
[ 81%] Building CXX object CMakeFiles/Test.dir/CaseChangingCharStream.cpp.o
[ 87%] Building CXX object CMakeFiles/Test.dir/antlr4cpp_generated_src/Cypher/CypherLexer.cpp.o
[ 93%] Building CXX object CMakeFiles/Test.dir/antlr4cpp_generated_src/Cypher/CypherParser.cpp.o
[100%] Linking CXX executable Test
make[3]: warning:  Clock skew detected.  Your build may be incomplete.
make[3]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
[100%] Built target Test
make[2]: warning:  Clock skew detected.  Your build may be incomplete.
make[2]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
make[1]: warning:  Clock skew detected.  Your build may be incomplete.
make[1]: Leaving directory '/mnt/c/msys64/home/Kenne/ques/r3/Generated/build'
ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3/Generated$ ./build/Test
time:229817

<<<< SNIP >>>>

ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal
ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3$ g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3$ cd Generated/
ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3/Generated$ grep antlr4-4 *.txt
CMakeLists.txt:set(ANTLR_EXECUTABLE "/tmp/antlr4-4.11.1-complete.jar")
ken@DESKTOP-DL44R7B:/mnt/c/msys64/home/Kenne/ques/r3/Generated$

Attached is the complete source, including grammar, make files, C++ code. Generated.zip

kaby76 commented 2 years ago

At this point, we need your grammar file after symbol renamings, preferably everything in a zip file. I followed your instructions for renaming, used your driver, input, and cannot reproduce the performance problem.

hagen666 commented 2 years ago

I'm glad you ran the driver. The parsing time differs obviously in my CentOS 7 and gcc 7.3.0. Some data is here: origin Cypher.g4 file: 240ms The file you provided: 577ms - 618ms The file only remove oC_ and only renameing rule 'Cypher:' to 'cypher:': 7000ms - 9172ms the slowest file is below:

/**
 * Copyright (c) 2015-2022 "Neo Technology,"
 * Network Engine for Objects in Lund AB [http://neotechnology.com]
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 *
 * Attribution Notice under the terms of the Apache License 2.0
 *
 * This work was created by the collective efforts of the openCypher community.
 * Without limiting the terms of Section 6, any Derivative Work that is not
 * approved by the public consensus process of the openCypher Implementers Group
 * should not be described as “Cypher” (and Cypher® is a registered trademark of
 * Neo4j Inc.) or as "openCypher". Extensions by implementers or prototypes or
 * proposals for change that have been documented or implemented should only be
 * described as "implementation extensions to Cypher" or as "proposed changes to
 * Cypher that are not yet approved by the openCypher community".
 */
grammar CCypher;

cypher
      :  SP? Statement ( SP? ';' )? SP? EOF ;

Statement
         :  Query ;

Query
     :  RegularQuery
         | StandaloneCall
         ;

RegularQuery
            :  SingleQuery ( SP? Union )* ;

Union
     :  ( UNION SP ALL SP? SingleQuery )
         | ( UNION SP? SingleQuery )
         ;

UNION : ( 'U' | 'u' ) ( 'N' | 'n' ) ( 'I' | 'i' ) ( 'O' | 'o' ) ( 'N' | 'n' ) ;

ALL : ( 'A' | 'a' ) ( 'L' | 'l' ) ( 'L' | 'l' ) ;

SingleQuery
           :  SinglePartQuery
               | MultiPartQuery
               ;

SinglePartQuery
               :  ( ( ReadingClause SP? )* Return )
                   | ( ( ReadingClause SP? )* UpdatingClause ( SP? UpdatingClause )* ( SP? Return )? )
                   ;

MultiPartQuery
              :  ( ( ReadingClause SP? )* ( UpdatingClause SP? )* With SP? )+ SinglePartQuery ;

UpdatingClause
              :  Create
                  | Merge
                  | Delete
                  | Set
                  | Remove
                  ;

ReadingClause
             :  Match
                 | Unwind
                 | InQueryCall
                 ;

Match
     :  ( OPTIONAL SP )? MATCH SP? Pattern ( SP? Where )? ;

OPTIONAL : ( 'O' | 'o' ) ( 'P' | 'p' ) ( 'T' | 't' ) ( 'I' | 'i' ) ( 'O' | 'o' ) ( 'N' | 'n' ) ( 'A' | 'a' ) ( 'L' | 'l' ) ;

MATCH : ( 'M' | 'm' ) ( 'A' | 'a' ) ( 'T' | 't' ) ( 'C' | 'c' ) ( 'H' | 'h' ) ;

Unwind
      :  UNWIND SP? Expression SP AS SP Variable ;

UNWIND : ( 'U' | 'u' ) ( 'N' | 'n' ) ( 'W' | 'w' ) ( 'I' | 'i' ) ( 'N' | 'n' ) ( 'D' | 'd' ) ;

AS : ( 'A' | 'a' ) ( 'S' | 's' ) ;

Merge
     :  MERGE SP? PatternPart ( SP MergeAction )* ;

MERGE : ( 'M' | 'm' ) ( 'E' | 'e' ) ( 'R' | 'r' ) ( 'G' | 'g' ) ( 'E' | 'e' ) ;

MergeAction
           :  ( ON SP MATCH SP Set )
               | ( ON SP CREATE SP Set )
               ;

ON : ( 'O' | 'o' ) ( 'N' | 'n' ) ;

CREATE : ( 'C' | 'c' ) ( 'R' | 'r' ) ( 'E' | 'e' ) ( 'A' | 'a' ) ( 'T' | 't' ) ( 'E' | 'e' ) ;

Create
      :  CREATE SP? Pattern ;

Set
   :  SET SP? SetItem ( SP? ',' SP? SetItem )* ;

SET : ( 'S' | 's' ) ( 'E' | 'e' ) ( 'T' | 't' ) ;

SetItem
       :  ( PropertyExpression SP? '=' SP? Expression )
           | ( Variable SP? '=' SP? Expression )
           | ( Variable SP? '+=' SP? Expression )
           | ( Variable SP? NodeLabels )
           ;

Delete
      :  ( DETACH SP )? DELETE SP? Expression ( SP? ',' SP? Expression )* ;

DETACH : ( 'D' | 'd' ) ( 'E' | 'e' ) ( 'T' | 't' ) ( 'A' | 'a' ) ( 'C' | 'c' ) ( 'H' | 'h' ) ;

DELETE : ( 'D' | 'd' ) ( 'E' | 'e' ) ( 'L' | 'l' ) ( 'E' | 'e' ) ( 'T' | 't' ) ( 'E' | 'e' ) ;

Remove
      :  REMOVE SP RemoveItem ( SP? ',' SP? RemoveItem )* ;

REMOVE : ( 'R' | 'r' ) ( 'E' | 'e' ) ( 'M' | 'm' ) ( 'O' | 'o' ) ( 'V' | 'v' ) ( 'E' | 'e' ) ;

RemoveItem
          :  ( Variable NodeLabels )
              | PropertyExpression
              ;

InQueryCall
           :  CALL SP ExplicitProcedureInvocation ( SP? YIELD SP YieldItems )? ;

CALL : ( 'C' | 'c' ) ( 'A' | 'a' ) ( 'L' | 'l' ) ( 'L' | 'l' ) ;

YIELD : ( 'Y' | 'y' ) ( 'I' | 'i' ) ( 'E' | 'e' ) ( 'L' | 'l' ) ( 'D' | 'd' ) ;

StandaloneCall
              :  CALL SP ( ExplicitProcedureInvocation | ImplicitProcedureInvocation ) ( SP? YIELD SP ( '*' | YieldItems ) )? ;

YieldItems
          :  YieldItem ( SP? ',' SP? YieldItem )* ( SP? Where )? ;

YieldItem
         :  ( ProcedureResultField SP AS SP )? Variable ;

With
    :  WITH ProjectionBody ( SP? Where )? ;

WITH : ( 'W' | 'w' ) ( 'I' | 'i' ) ( 'T' | 't' ) ( 'H' | 'h' ) ;

Return
      :  RETURN ProjectionBody ;

RETURN : ( 'R' | 'r' ) ( 'E' | 'e' ) ( 'T' | 't' ) ( 'U' | 'u' ) ( 'R' | 'r' ) ( 'N' | 'n' ) ;

ProjectionBody
              :  ( SP? DISTINCT )? SP ProjectionItems ( SP Order )? ( SP Skip )? ( SP Limit )? ;

DISTINCT : ( 'D' | 'd' ) ( 'I' | 'i' ) ( 'S' | 's' ) ( 'T' | 't' ) ( 'I' | 'i' ) ( 'N' | 'n' ) ( 'C' | 'c' ) ( 'T' | 't' ) ;

ProjectionItems
               :  ( '*' ( SP? ',' SP? ProjectionItem )* )
                   | ( ProjectionItem ( SP? ',' SP? ProjectionItem )* )
                   ;

ProjectionItem
              :  ( Expression SP AS SP Variable )
                  | Expression
                  ;

Order
     :  ORDER SP BY SP SortItem ( ',' SP? SortItem )* ;

ORDER : ( 'O' | 'o' ) ( 'R' | 'r' ) ( 'D' | 'd' ) ( 'E' | 'e' ) ( 'R' | 'r' ) ;

BY : ( 'B' | 'b' ) ( 'Y' | 'y' ) ;

Skip
    :  L_SKIP SP Expression ;

L_SKIP : ( 'S' | 's' ) ( 'K' | 'k' ) ( 'I' | 'i' ) ( 'P' | 'p' ) ;

Limit
     :  LIMIT SP Expression ;

LIMIT : ( 'L' | 'l' ) ( 'I' | 'i' ) ( 'M' | 'm' ) ( 'I' | 'i' ) ( 'T' | 't' ) ;

SortItem
        :  Expression ( SP? ( ASCENDING | ASC | DESCENDING | DESC ) )? ;

ASCENDING : ( 'A' | 'a' ) ( 'S' | 's' ) ( 'C' | 'c' ) ( 'E' | 'e' ) ( 'N' | 'n' ) ( 'D' | 'd' ) ( 'I' | 'i' ) ( 'N' | 'n' ) ( 'G' | 'g' ) ;

ASC : ( 'A' | 'a' ) ( 'S' | 's' ) ( 'C' | 'c' ) ;

DESCENDING : ( 'D' | 'd' ) ( 'E' | 'e' ) ( 'S' | 's' ) ( 'C' | 'c' ) ( 'E' | 'e' ) ( 'N' | 'n' ) ( 'D' | 'd' ) ( 'I' | 'i' ) ( 'N' | 'n' ) ( 'G' | 'g' ) ;

DESC : ( 'D' | 'd' ) ( 'E' | 'e' ) ( 'S' | 's' ) ( 'C' | 'c' ) ;

Where
     :  WHERE SP Expression ;

WHERE : ( 'W' | 'w' ) ( 'H' | 'h' ) ( 'E' | 'e' ) ( 'R' | 'r' ) ( 'E' | 'e' ) ;

Pattern
       :  PatternPart ( SP? ',' SP? PatternPart )* ;

PatternPart
           :  ( Variable SP? '=' SP? AnonymousPatternPart )
               | AnonymousPatternPart
               ;

AnonymousPatternPart
                    :  PatternElement ;

PatternElement
              :  ( NodePattern ( SP? PatternElementChain )* )
                  | ( '(' PatternElement ')' )
                  ;

RelationshipsPattern
                    :  NodePattern ( SP? PatternElementChain )+ ;

NodePattern
           :  '(' SP? ( Variable SP? )? ( NodeLabels SP? )? ( Properties SP? )? ')' ;

PatternElementChain
                   :  RelationshipPattern SP? NodePattern ;

RelationshipPattern
                   :  ( LeftArrowHead SP? Dash SP? RelationshipDetail? SP? Dash SP? RightArrowHead )
                       | ( LeftArrowHead SP? Dash SP? RelationshipDetail? SP? Dash )
                       | ( Dash SP? RelationshipDetail? SP? Dash SP? RightArrowHead )
                       | ( Dash SP? RelationshipDetail? SP? Dash )
                       ;

RelationshipDetail
                  :  '[' SP? ( Variable SP? )? ( RelationshipTypes SP? )? RangeLiteral? ( Properties SP? )? ']' ;

Properties
          :  MapLiteral
              | Parameter
              ;

RelationshipTypes
                 :  ':' SP? RelTypeName ( SP? '|' ':'? SP? RelTypeName )* ;

NodeLabels
          :  NodeLabel ( SP? NodeLabel )* ;

NodeLabel
         :  ':' SP? LabelName ;

RangeLiteral
            :  '*' SP? ( IntegerLiteral SP? )? ( '..' SP? ( IntegerLiteral SP? )? )? ;

LabelName
         :  SchemaName ;

RelTypeName
           :  SchemaName ;

PropertyExpression
                  :  Atom ( SP? PropertyLookup )+ ;

Expression
          :  OrExpression ;

OrExpression
            :  XorExpression ( SP OR SP XorExpression )* ;

OR : ( 'O' | 'o' ) ( 'R' | 'r' ) ;

XorExpression
             :  AndExpression ( SP XOR SP AndExpression )* ;

XOR : ( 'X' | 'x' ) ( 'O' | 'o' ) ( 'R' | 'r' ) ;

AndExpression
             :  NotExpression ( SP AND SP NotExpression )* ;

AND : ( 'A' | 'a' ) ( 'N' | 'n' ) ( 'D' | 'd' ) ;

NotExpression
             :  ( NOT SP? )* ComparisonExpression ;

NOT : ( 'N' | 'n' ) ( 'O' | 'o' ) ( 'T' | 't' ) ;

ComparisonExpression
                    :  StringListNullPredicateExpression ( SP? PartialComparisonExpression )* ;

PartialComparisonExpression
                           :  ( '=' SP? StringListNullPredicateExpression )
                               | ( '<>' SP? StringListNullPredicateExpression )
                               | ( '<' SP? StringListNullPredicateExpression )
                               | ( '>' SP? StringListNullPredicateExpression )
                               | ( '<=' SP? StringListNullPredicateExpression )
                               | ( '>=' SP? StringListNullPredicateExpression )
                               ;

StringListNullPredicateExpression
                                 :  AddOrSubtractExpression ( StringPredicateExpression | ListPredicateExpression | NullPredicateExpression )* ;

StringPredicateExpression
                         :  ( ( SP STARTS SP WITH ) | ( SP ENDS SP WITH ) | ( SP CONTAINS ) ) SP? AddOrSubtractExpression ;

STARTS : ( 'S' | 's' ) ( 'T' | 't' ) ( 'A' | 'a' ) ( 'R' | 'r' ) ( 'T' | 't' ) ( 'S' | 's' ) ;

ENDS : ( 'E' | 'e' ) ( 'N' | 'n' ) ( 'D' | 'd' ) ( 'S' | 's' ) ;

CONTAINS : ( 'C' | 'c' ) ( 'O' | 'o' ) ( 'N' | 'n' ) ( 'T' | 't' ) ( 'A' | 'a' ) ( 'I' | 'i' ) ( 'N' | 'n' ) ( 'S' | 's' ) ;

ListPredicateExpression
                       :  SP IN SP? AddOrSubtractExpression ;

IN : ( 'I' | 'i' ) ( 'N' | 'n' ) ;

NullPredicateExpression
                       :  ( SP IS SP NULL_VALUE )
                           | ( SP IS SP NOT SP NULL_VALUE )
                           ;

IS : ( 'I' | 'i' ) ( 'S' | 's' ) ;

NULL_VALUE : ( 'N' | 'n' ) ( 'U' | 'u' ) ( 'L' | 'l' ) ( 'L' | 'l' ) ;

AddOrSubtractExpression
                       :  MultiplyDivideModuloExpression ( ( SP? '+' SP? MultiplyDivideModuloExpression ) | ( SP? '-' SP? MultiplyDivideModuloExpression ) )* ;

MultiplyDivideModuloExpression
                              :  PowerOfExpression ( ( SP? '*' SP? PowerOfExpression ) | ( SP? '/' SP? PowerOfExpression ) | ( SP? '%' SP? PowerOfExpression ) )* ;

PowerOfExpression
                 :  UnaryAddOrSubtractExpression ( SP? '^' SP? UnaryAddOrSubtractExpression )* ;

UnaryAddOrSubtractExpression
                            :  ListOperatorExpression
                                | ( ( '+' | '-' ) SP? ListOperatorExpression )
                                ;

ListOperatorExpression
                      :  PropertyOrLabelsExpression ( ( SP? '[' Expression ']' ) | ( SP? '[' Expression? '..' Expression? ']' ) )* ;

PropertyOrLabelsExpression
                          :  Atom ( SP? PropertyLookup )* ( SP? NodeLabels )? ;

PropertyLookup
              :  '.' SP? ( PropertyKeyName ) ;

Atom
    :  Literal
        | Parameter
        | CaseExpression
        | ( COUNT SP? '(' SP? '*' SP? ')' )
        | ListComprehension
        | PatternComprehension
        | Quantifier
        | PatternPredicate
        | ParenthesizedExpression
        | FunctionInvocation
        | ExistentialSubquery
        | Variable
        ;

COUNT : ( 'C' | 'c' ) ( 'O' | 'o' ) ( 'U' | 'u' ) ( 'N' | 'n' ) ( 'T' | 't' ) ;

CaseExpression
              :  ( ( CASE ( SP? CaseAlternative )+ ) | ( CASE SP? Expression ( SP? CaseAlternative )+ ) ) ( SP? ELSE SP? Expression )? SP? END ;

CASE : ( 'C' | 'c' ) ( 'A' | 'a' ) ( 'S' | 's' ) ( 'E' | 'e' ) ;

ELSE : ( 'E' | 'e' ) ( 'L' | 'l' ) ( 'S' | 's' ) ( 'E' | 'e' ) ;

END : ( 'E' | 'e' ) ( 'N' | 'n' ) ( 'D' | 'd' ) ;

CaseAlternative
               :  WHEN SP? Expression SP? THEN SP? Expression ;

WHEN : ( 'W' | 'w' ) ( 'H' | 'h' ) ( 'E' | 'e' ) ( 'N' | 'n' ) ;

THEN : ( 'T' | 't' ) ( 'H' | 'h' ) ( 'E' | 'e' ) ( 'N' | 'n' ) ;

ListComprehension
                 :  '[' SP? FilterExpression ( SP? '|' SP? Expression )? SP? ']' ;

PatternComprehension
                    :  '[' SP? ( Variable SP? '=' SP? )? RelationshipsPattern SP? ( Where SP? )? '|' SP? Expression SP? ']' ;

Quantifier
          :  ( ALL SP? '(' SP? FilterExpression SP? ')' )
              | ( ANY SP? '(' SP? FilterExpression SP? ')' )
              | ( NONE SP? '(' SP? FilterExpression SP? ')' )
              | ( SINGLE SP? '(' SP? FilterExpression SP? ')' )
              ;

ANY : ( 'A' | 'a' ) ( 'N' | 'n' ) ( 'Y' | 'y' ) ;

NONE : ( 'N' | 'n' ) ( 'O' | 'o' ) ( 'N' | 'n' ) ( 'E' | 'e' ) ;

SINGLE : ( 'S' | 's' ) ( 'I' | 'i' ) ( 'N' | 'n' ) ( 'G' | 'g' ) ( 'L' | 'l' ) ( 'E' | 'e' ) ;

FilterExpression
                :  IdInColl ( SP? Where )? ;

PatternPredicate
                :  RelationshipsPattern ;

ParenthesizedExpression
                       :  '(' SP? Expression SP? ')' ;

IdInColl
        :  Variable SP IN SP Expression ;

FunctionInvocation
                  :  FunctionName SP? '(' SP? ( DISTINCT SP? )? ( Expression SP? ( ',' SP? Expression SP? )* )? ')' ;

FunctionName
            :  Namespace SymbolicName ;

ExistentialSubquery
                   :  EXISTS SP? '{' SP? ( RegularQuery | ( Pattern ( SP? Where )? ) ) SP? '}' ;

EXISTS : ( 'E' | 'e' ) ( 'X' | 'x' ) ( 'I' | 'i' ) ( 'S' | 's' ) ( 'T' | 't' ) ( 'S' | 's' ) ;

ExplicitProcedureInvocation
                           :  ProcedureName SP? '(' SP? ( Expression SP? ( ',' SP? Expression SP? )* )? ')' ;

ImplicitProcedureInvocation
                           :  ProcedureName ;

ProcedureResultField
                    :  SymbolicName ;

ProcedureName
             :  Namespace SymbolicName ;

Namespace
         :  ( SymbolicName '.' )* ;

Variable
        :  SymbolicName ;

Literal
       :  BooleanLiteral
           | NULL_VALUE
           | NumberLiteral
           | StringLiteral
           | ListLiteral
           | MapLiteral
           ;

BooleanLiteral
              :  TRUE
                  | FALSE
                  ;

TRUE : ( 'T' | 't' ) ( 'R' | 'r' ) ( 'U' | 'u' ) ( 'E' | 'e' ) ;

FALSE : ( 'F' | 'f' ) ( 'A' | 'a' ) ( 'L' | 'l' ) ( 'S' | 's' ) ( 'E' | 'e' ) ;

NumberLiteral
             :  DoubleLiteral
                 | IntegerLiteral
                 ;

IntegerLiteral
              :  HexInteger
                  | OctalInteger
                  | DecimalInteger
                  ;

HexInteger
          :  '0x' ( HexDigit )+ ;

DecimalInteger
              :  ZeroDigit
                  | ( NonZeroDigit ( Digit )* )
                  ;

OctalInteger
            :  '0o' ( OctDigit )+ ;

HexLetter
         :  ( ( 'A' | 'a' ) )
             | ( ( 'B' | 'b' ) )
             | ( ( 'C' | 'c' ) )
             | ( ( 'D' | 'd' ) )
             | ( ( 'E' | 'e' ) )
             | ( ( 'F' | 'f' ) )
             ;

HexDigit
        :  Digit
            | HexLetter
            ;

Digit
     :  ZeroDigit
         | NonZeroDigit
         ;

NonZeroDigit
            :  NonZeroOctDigit
                | '8'
                | '9'
                ;

NonZeroOctDigit
               :  '1'
                   | '2'
                   | '3'
                   | '4'
                   | '5'
                   | '6'
                   | '7'
                   ;

OctDigit
        :  ZeroDigit
            | NonZeroOctDigit
            ;

ZeroDigit
         :  '0' ;

DoubleLiteral
             :  ExponentDecimalReal
                 | RegularDecimalReal
                 ;

ExponentDecimalReal
                   :  ( ( Digit )+ | ( ( Digit )+ '.' ( Digit )+ ) | ( '.' ( Digit )+ ) ) ( ( 'E' | 'e' ) ) '-'? ( Digit )+ ;

RegularDecimalReal
                  :  ( Digit )* '.' ( Digit )+ ;

StringLiteral
             :  ( '"' ( StringLiteral_0 | EscapedChar )* '"' )
                 | ( '\'' ( StringLiteral_1 | EscapedChar )* '\'' )
                 ;

EscapedChar
           :  '\\' ( '\\' | '\'' | '"' | ( ( 'B' | 'b' ) ) | ( ( 'F' | 'f' ) ) | ( ( 'N' | 'n' ) ) | ( ( 'R' | 'r' ) ) | ( ( 'T' | 't' ) ) | ( ( ( 'U' | 'u' ) ) ( HexDigit HexDigit HexDigit HexDigit ) ) | ( ( ( 'U' | 'u' ) ) ( HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit ) ) ) ;

ListLiteral
           :  '[' SP? ( Expression SP? ( ',' SP? Expression SP? )* )? ']' ;

MapLiteral
          :  '{' SP? ( PropertyKeyName SP? ':' SP? Expression SP? ( ',' SP? PropertyKeyName SP? ':' SP? Expression SP? )* )? '}' ;

PropertyKeyName
               :  SchemaName ;

Parameter
         :  '$' ( SymbolicName | DecimalInteger ) ;

SchemaName
          :  SymbolicName
              | ReservedWord
              ;

ReservedWord
            :  ALL
                | ASC
                | ASCENDING
                | BY
                | CREATE
                | DELETE
                | DESC
                | DESCENDING
                | DETACH
                | EXISTS
                | LIMIT
                | MATCH
                | MERGE
                | ON
                | OPTIONAL
                | ORDER
                | REMOVE
                | RETURN
                | SET
                | L_SKIP
                | WHERE
                | WITH
                | UNION
                | UNWIND
                | AND
                | AS
                | CONTAINS
                | DISTINCT
                | ENDS
                | IN
                | IS
                | NOT
                | OR
                | STARTS
                | XOR
                | FALSE
                | TRUE
                | NULL_VALUE
                | CONSTRAINT
                | DO
                | FOR
                | REQUIRE
                | UNIQUE
                | CASE
                | WHEN
                | THEN
                | ELSE
                | END
                | MANDATORY
                | SCALAR
                | OF
                | ADD
                | DROP
                ;

CONSTRAINT : ( 'C' | 'c' ) ( 'O' | 'o' ) ( 'N' | 'n' ) ( 'S' | 's' ) ( 'T' | 't' ) ( 'R' | 'r' ) ( 'A' | 'a' ) ( 'I' | 'i' ) ( 'N' | 'n' ) ( 'T' | 't' ) ;

DO : ( 'D' | 'd' ) ( 'O' | 'o' ) ;

FOR : ( 'F' | 'f' ) ( 'O' | 'o' ) ( 'R' | 'r' ) ;

REQUIRE : ( 'R' | 'r' ) ( 'E' | 'e' ) ( 'Q' | 'q' ) ( 'U' | 'u' ) ( 'I' | 'i' ) ( 'R' | 'r' ) ( 'E' | 'e' ) ;

UNIQUE : ( 'U' | 'u' ) ( 'N' | 'n' ) ( 'I' | 'i' ) ( 'Q' | 'q' ) ( 'U' | 'u' ) ( 'E' | 'e' ) ;

MANDATORY : ( 'M' | 'm' ) ( 'A' | 'a' ) ( 'N' | 'n' ) ( 'D' | 'd' ) ( 'A' | 'a' ) ( 'T' | 't' ) ( 'O' | 'o' ) ( 'R' | 'r' ) ( 'Y' | 'y' ) ;

SCALAR : ( 'S' | 's' ) ( 'C' | 'c' ) ( 'A' | 'a' ) ( 'L' | 'l' ) ( 'A' | 'a' ) ( 'R' | 'r' ) ;

OF : ( 'O' | 'o' ) ( 'F' | 'f' ) ;

ADD : ( 'A' | 'a' ) ( 'D' | 'd' ) ( 'D' | 'd' ) ;

DROP : ( 'D' | 'd' ) ( 'R' | 'r' ) ( 'O' | 'o' ) ( 'P' | 'p' ) ;

SymbolicName
            :  UnescapedSymbolicName
                | EscapedSymbolicName
                | HexLetter
                | COUNT
                | FILTER
                | EXTRACT
                | ANY
                | NONE
                | SINGLE
                ;

FILTER : ( 'F' | 'f' ) ( 'I' | 'i' ) ( 'L' | 'l' ) ( 'T' | 't' ) ( 'E' | 'e' ) ( 'R' | 'r' ) ;

EXTRACT : ( 'E' | 'e' ) ( 'X' | 'x' ) ( 'T' | 't' ) ( 'R' | 'r' ) ( 'A' | 'a' ) ( 'C' | 'c' ) ( 'T' | 't' ) ;

UnescapedSymbolicName
                     :  IdentifierStart ( IdentifierPart )* ;

/**
 * Based on the unicode identifier and pattern syntax
 *   (http://www.unicode.org/reports/tr31/)
 * And extended with a few characters.
 */
IdentifierStart
               :  ID_Start
                   | Pc
                   ;

/**
 * Based on the unicode identifier and pattern syntax
 *   (http://www.unicode.org/reports/tr31/)
 * And extended with a few characters.
 */
IdentifierPart
              :  ID_Continue
                  | Sc
                  ;

/**
 * Any character except "`", enclosed within `backticks`. Backticks are escaped with double backticks.
 */
EscapedSymbolicName
                   :  ( '`' ( EscapedSymbolicName_0 )* '`' )+ ;

SP
  :  ( WHITESPACE )+ ;

WHITESPACE
          :  SPACE
              | TAB
              | LF
              | VT
              | FF
              | CR
              | FS
              | GS
              | RS
              | US
              | '\u1680'
              | '\u180e'
              | '\u2000'
              | '\u2001'
              | '\u2002'
              | '\u2003'
              | '\u2004'
              | '\u2005'
              | '\u2006'
              | '\u2008'
              | '\u2009'
              | '\u200a'
              | '\u2028'
              | '\u2029'
              | '\u205f'
              | '\u3000'
              | '\u00a0'
              | '\u2007'
              | '\u202f'
              | Comment
              ;

Comment
       :  ( '/*' ( Comment_1 | ( '*' Comment_2 ) )* '*/' )
           | ( '//' ( Comment_3 )* CR? ( LF | EOF ) )
           ;

LeftArrowHead
             :  '<'
                 | '\u27e8'
                 | '\u3008'
                 | '\ufe64'
                 | '\uff1c'
                 ;

RightArrowHead
              :  '>'
                  | '\u27e9'
                  | '\u3009'
                  | '\ufe65'
                  | '\uff1e'
                  ;

Dash
    :  '-'
        | '\u00ad'
        | '\u2010'
        | '\u2011'
        | '\u2012'
        | '\u2013'
        | '\u2014'
        | '\u2015'
        | '\u2212'
        | '\ufe58'
        | '\ufe63'
        | '\uff0d'
        ;

fragment FF : [\f] ;

fragment EscapedSymbolicName_0 : ~[`] ;

fragment RS : [\u001E] ;

fragment ID_Continue : [\p{ID_Continue}] ;

fragment Comment_1 : ~[*] ;

fragment StringLiteral_1 : ~['\\] ;

fragment Comment_3 : ~[\n\r] ;

fragment Comment_2 : ~[/] ;

fragment GS : [\u001D] ;

fragment FS : [\u001C] ;

fragment CR : [\r] ;

fragment Sc : [\p{Sc}] ;

fragment SPACE : [ ] ;

fragment Pc : [\p{Pc}] ;

fragment TAB : [\t] ;

fragment StringLiteral_0 : ~["\\] ;

fragment LF : [\n] ;

fragment VT : [\u000B] ;

fragment US : [\u001F] ;

fragment ID_Start : [\p{ID_Start}] ;
kaby76 commented 2 years ago

Thanks for that grammar. In the future, please provide the complete code rather than just instructions and incomplete source code. The instructions were not clear, and the drivers were missing #include and using statements.

I have now reproduced the performance issue. Attached is a zip file containing all sources and results of tests on my machine. issue-3890.zip

Methods

=======
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:    20.04
Codename:   focal
=======
MemTotal:        8103208 kB
=======
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           8
Model name:                      AMD Ryzen 7 2700 Eight-Core Processor
Stepping:                        2
CPU MHz:                         3194.111
BogoMIPS:                        6388.22
Virtualization:                  AMD-V
Hypervisor vendor:               Microsoft
Virtualization type:             full
L1d cache:                       256 KiB
L1i cache:                       512 KiB
L2 cache:                        4 MiB
L3 cache:                        16 MiB
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr virt_ssbd arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload
=======
openjdk 11.0.16 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu120.04)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu120.04, mixed mode, sharing)
=======
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

=======
cmake version 3.16.3

CMake suite maintained and supported by Kitware (kitware.com/cmake).
=======
ANTLR Parser Generator  Version 4.11.1
=======
GNU Make 4.2.1
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

I constructed 12 different programs using your modified grammar CCypher.g4 and the original one Cypher.g4.

Results

Below are times for input, across all 12 combinations of grammars and drivers, and an option for "warm-up".

Program Time (M ± SD) Time w/ warm-up (M ± SD)
h+cpp 4.84 ± 0.30 n.a.
h+java 2.53 ± 0.20 n.a.
h+m+cpp 4.54 ± 0.28 0.076 ± 0.004
h+m+java 2.44 ± 0.13 0.078 ± 0.002
h+s+cpp 4.37 ± 0.14 0.095 ± 0.006
h+s+java 2.54 ± 0.24 0.114 ± 0.013
o+cpp 0.218 ± 0.003 n.a.
o+java 0.807 ± 0.039 n.a.
o+m+cpp 1.36 ± 0.04 1.35 ± 0.01
o+m+java 1.21 ± 0.10 1.17 ± 0.08
o+s+cpp 1.34 ± 0.11 1.33 ± 0.05
o+s+java 1.55 ± 0.14 1.61 ± 0.08

Discussion

Renaming parser symbols to other parser symbols should not have any impact on parsing speed as long as the renaming is one-to-one, and does not introduce ambiguities. But, that is not what was done here. This refactoring is the renaming of most parser symbols into lexer symbols.

Renaming parser symbols into lexer symbols isn't usually possible without major changes to the rules. But for Cypher.g4, it's easy to do because there aren't any tokens that are "skip" or "channel(HIDDEN)", whereas most grammars (~90% of the grammars in grammars-v4) do. While it's not something one would typically do, it's not out of the question (see this comment in StackOverflow on combining grammars).

Before the renaming, the parser runs reasonably fast over the entire input (~1.3s for both C++ and Java). Warm-up has no effect on the parse, which is surprising.

After the renaming, the parser runs noticeably slower (~4.5s C++, ~2.5s Java). Warm-up has a huge impact, running much faster (~0.1s for both C++ and Java). It's unclear why warm-up has a large impact on parsing speed. Based on these data, warm-up may help the speed of the parse.

Unfortunately, the Antlr runtime does not collect performance information for the lexer. This would be a valuable feature to add.

hagen666 commented 2 years ago
Yep! It's exciting that you reproduce the performance issue. Program Time (M ± SD) Time w/ warm-up (M ± SD)
h+cpp 4.84 ± 0.30 n.a.
o+cpp 0.218 ± 0.003 n.a.
h+s+cpp 4.37 ± 0.14 0.095 ± 0.006
o+s+cpp 1.34 ± 0.11 1.33 ± 0.05

It seems that the ranaming make the parser symbols to lexer symbols, and causes the performance issue.

hagen666 commented 2 years ago

By the way, I want to know how to add the 'warm-up' option.