cwbaker / lalr

LALR(1) parser for C++
MIT License
78 stars 11 forks source link

Parse arrays in the JSON example #32

Closed cwbaker closed 1 year ago

cwbaker commented 1 year ago

Fixes problem reported as part of #11.

mingodad commented 1 year ago

Nice to see that you are fixing the reported issues ! Still waiting for the BitArray to land with improved memory usage and enhanced performance.

cwbaker commented 1 year ago

Nice to see that you are fixing the reported issues! Still waiting for the BitArray to land with improved memory usage and enhanced performance.

You should give the latest main a try. Those changes and more efficient algorithms to generate states and propagate lookaheads are in place since 7f1b4c97. Compiling the PostgreSQL grammar takes ~7s now vs. Bisons ~2s on my local machine.

I think most of the issues reported in #11 are addressed. I'll reply to those things I didn't fix in the issue itself, but tomorrow because it's getting late here.

mingodad commented 1 year ago

It seems that there is a problem with missing -lpthread:

forge
...
/tmp/lalr/debug/lib/liblalr_linux_native.a(ThreadPool.o): In function `std::thread::thread<void (lalr::ThreadPool::*)(), lalr::ThreadPool*, void>(void (lalr::ThreadPool::*&&)(), lalr::ThreadPool*&&)':
/usr/include/c++/9/thread:126: undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
/home/mingo/local/forge/lua/forge/init.lua:58: g++ -march=native -std=c++11 -g -o "/tmp/lalr/debug/bin/lalrc" -L "/tmp/lalr/debug/lib" "dot.o" "lalrc.o" -llalr_linux_native failed
forge: 'all' failed for lack of 'lalrc'.
...
mingodad commented 1 year ago

Inlining the trivial methods can reduce the time to parser postgresql to around 4s but the memory usage is still a bit high: Your last commit:

/usr/bin/time /home/mingo/dev/c/A_grammars/lalr-last0/release/bin/lalrc postgresql.g > /dev/null
6.98user 0.35system 0:04.32elapsed 169%CPU (0avgtext+0avgdata 889332maxresident)k
0inputs+0outputs (0major+233264minor)pagefaults 0swaps

With trivial methods inlined:

/usr/bin/time /home/mingo/dev/c/A_grammars/lalr-last/release/bin/lalrc postgresql.g > /dev/null
4.46user 0.39system 0:03.44elapsed 141%CPU (0avgtext+0avgdata 889216maxresident)k
0inputs+0outputs (0major+233233minor)pagefaults 0swaps

My previous changes using a BitArray, notice the memory usage:

/usr/bin/time /home/mingo/dev/c/A_grammars/lalr/lalr-nb/dist/Release/GNU-Linux/lalr-nb postgresql.g > /dev/null
21.67user 0.13system 0:21.81elapsed 99%CPU (0avgtext+0avgdata 270072maxresident)k
0inputs+0outputs (0major+66870minor)pagefaults 0swaps
mingodad commented 1 year ago

I'm not sure how to test this new json parser with https://github.com/tree-sitter/tree-sitter-verilog/blob/master/src/grammar.json (7.4MB) and https://github.com/tree-sitter/tree-sitter-cpp/blob/master/src/grammar.json (300KB) ?

With this json parser:

// /usr/bin/time test-grammar-nb -g json2.g -i BytecodeList.json
// /usr/bin/time test-grammar-nb -g json2.g -i grammar.json
json {
   %whitespace "[ \t\r\n]*";
   document: object | array [document] ;
   value: 'null' [value] | 'true' [value] | 'false' [value] | integer [value] | real [value] | string [value] | array [value] | object [value];
   object: '{' kv_list '}' [object];
   array: '[' value_list ']' [array];
   value_list :
    /*empty*/
    | value
    | value_list ',' value
    ;
   kv_list:
    /*empty*/
    | key_val
    | kv_list ',' key_val
    ;
   key_val:
    string ':' value [key_val]
    ;
   integer: "(\+|\-)?[0-9]+";
   real: "(\+|\-)?[0-9]+(\.[0-9]+)?((e|E)(\+|\-)?[0-9]+)?";
   string: "[\"](\\[\"\\]|[^\"\n])*[\"]|['](\\['\\]|[^'\n])*[']";
}

It parses fine tree-sitter-cpp:

/usr/bin/time ./grammar_test -g json2.g -i grammar.json 
read grammar: Time taken 0 seconds 0 milliseconds
Grammar size = 953
compile grammar: Time taken 0 seconds 6 milliseconds
read input: Time taken 0 seconds 0 milliseconds
Input size = 308231
parse input: Time taken 0 seconds 38 milliseconds
accepted = 1, full = 1
0.04user 0.00system 0:00.04elapsed 104%CPU (0avgtext+0avgdata 4212maxresident)k
0inputs+0outputs (0major+283minor)pagefaults 0swaps

But on parsing tree-sitter-verilog:

/usr/bin/time ./grammar_test -g json2.g -i grammar-big.json read grammar: Time taken 0 seconds 0 milliseconds
Grammar size = 953
compile grammar: Time taken 0 seconds 8 milliseconds
read input: Time taken 0 seconds 6 milliseconds
Input size = 7763660
lalr (5552): ERROR: Syntax error on '"' when expecting object, array, value, null_terminal, true_terminal, false_terminal, left_curly_brace_terminal, left_square_paren_terminal, integer, real, string
parse input: Time taken 0 seconds 17 milliseconds
accepted = 0, full = 0
0.02user 0.00system 0:00.03elapsed 105%CPU (0avgtext+0avgdata 11316maxresident)k
0inputs+0outputs (0major+2104minor)pagefaults 0swaps
mingodad commented 1 year ago

I have added this to help debug grammars:

----------------------------- src/lalr/Parser.hpp -----------------------------
index a9e2a69..85e7674 100644
@@ -79,6 +79,7 @@ public:
     void parse( Iterator start, Iterator finish );
     bool parse( const void* symbol, const std::basic_string<Char, Traits, Allocator>& lexeme, int line, int column );
     bool parse( const ParserSymbol* symbol, const std::basic_string<Char, Traits, Allocator>& lexeme, int line, int column );
+    void dumpLex( Iterator start, Iterator finish );

 private:
     const ParserTransition* find_transition( const ParserSymbol* symbol, const ParserState* state ) const;

----------------------------- src/lalr/Parser.ipp -----------------------------
index 30f5670..3e8c3cb 100644
@@ -146,6 +146,32 @@ const UserData& Parser<Iterator, UserData, Char, Traits, Allocator>::user_data()
     return user_data_.front();
 }

+template <class Iterator, class UserData, class Char, class Traits, class Allocator>
+void Parser<Iterator, UserData, Char, Traits, Allocator>::dumpLex( Iterator start, Iterator finish )
+{
+    LALR_ASSERT( state_machine_ );
+
+    reset();
+    lexer_.reset( start, finish );    
+    lexer_.advance();
+    while ( !lexer_.full() )
+    {
+        const ParserSymbol* symbol = reinterpret_cast<const ParserSymbol*>( lexer_.symbol() );
+        //printf("%d:%d:%d:%d:[%s]:[%s]:[%s]\n", lexer_.line(), lexer_.column(), symbol->type, symbol->index, symbol->identifier, symbol->lexeme, lexer_.lexeme().c_str());
+        //printf("%d:%d:%d:%d:[%s]\n", lexer_.line(), lexer_.column(), symbol->type, symbol->index, symbol->identifier);
+        //printf("%d:%d:%d:%d\n", lexer_.line(), lexer_.column(), symbol->type, symbol->index);
+        //printf("%d:%d:%p\n", lexer_.line(), lexer_.column(), symbol);
+        printf("%d:%d:", lexer_.line(), lexer_.column());
+        if(symbol) printf("%d:%d:[%s]:[%s]", symbol->type, symbol->index, symbol->identifier, symbol->lexeme);
+        else printf("-1:-1:[]:[]");
+        printf(":[%s]\n", lexer_.lexeme().c_str());
+        //fflush(stdout);
+        lexer_.advance();
+    }
+
+    full_ = lexer_.full();
+}
+

 /**
 // Get the Lexer that is being used by this Parser.

And then applied it to tree-sitter-verilog and got this:

/usr/bin/time ./grammar_test -g json2.g -i grammar-big.json -d > a.txt
...
grep '5552:' a.txt
5552:17:1:22:[string]:[[\"](\\[\"\\]|[^\"\n])*[\"]|['](\\['\\]|[^'\n])*[']]:["value"]
5552:24:1:19:[colon_terminal]:[:]:[:]
5552:26:-1:-1:[]:[]:["]
5552:27:-1:-1:[]:[]:[]
5552:31:-1:-1:[]:[]:["]

Here is the lines around 5552:

              {
                "type": "STRING",
                "value": "–>"  ///// 5552
              },
mingodad commented 1 year ago

If I change "value": "–>" to "value": "arrow" then it goes on to stop again on "value": "–" so there is something wrong in the way the string pattern ["](\["\]|[^"\n])*["] is converted/generated/interpreted because it seems to not accept punctuation marks inside it. And now that I transcribed the "[\"](\\[\"\\]|[^\"\n])*[\"]" pattern to ["](\["\]|[^"\n])*["] I noticed \] in the resulting pattern and then probably the correct way to write it is "[\"](\\[\"\\\\]|[^\"\n])*[\"]" to end up with ["](\["\\]|[^"\n])*["] but there is no error message about it and even using \\\\ in the pattern I'm still getting the same error on non alpha characters inside strings.

mingodad commented 1 year ago

After a while I found one of the problems is that it's using utf8 characters, comparing with the tree-sitter-cpp that also has an arrow symbol:

"value": "–>"  ///verilog
"value": "->"  ///cpp

Probably the verillog grammar author doesn't even know about it.

But any way how will lalr handle this situations ?

cwbaker commented 1 year ago

UTF-8 should be supported with a UTF-8 aware iterator. I say should only because I've never really tested it.

See #93 for the JSON example updated to parse a UTF-8 encoded JSON file.

Roughly the following gets it going, note the use of istreambuf_iterator<char32_t> to load char32_t values from an ifstream and codecvt<char32_t, char, std::mbstate_t> to convert UTF-8 bytes into char32_t:

Parser<istreambuf_iterator<char32_t>, JsonUserData> parser( json_parser_state_machine );

// ...

using std::locale;
using std::codecvt;
using std::basic_ifstream;
using std::istreambuf_iterator;
std::basic_ifstream<char32_t> file( LALR_EXAMPLES "lalr_json_example.json", std::ios_base::binary );
file.imbue( locale(file.getloc(), new codecvt<char32_t, char, std::mbstate_t>) );
istreambuf_iterator<char32_t> input( file );
istreambuf_iterator<char32_t> input_end;

parser.parse( input, input_end );
LALR_ASSERT( parser.accepted() );
LALR_ASSERT( parser.full() );
mingodad commented 1 year ago

What if instead of managing in utf-32 I want to mange in utf-8 but interpret it as unsigned char, because right now the problem seem to be interpreting char as signed (remember on ARM char is unsigned).

mingodad commented 1 year ago

GrammarCompiler only accept signed char could it be a template or accept unsigned char ?

int GrammarCompiler::compile( const char* begin, const char* end, ErrorPolicy* error_policy )
mingodad commented 1 year ago

Forget the previous message for now, I'm more interested on parsing the input that contains utf-8 without need convert to utf-32.

mingodad commented 1 year ago

Ok ! Looking at your changes to allow utf-32 and using lalr::Parser<const unsigned char*, int> parser( compiler.parser_state_machine(), &error_policy_input ); then it parses the input containing utf-8 without need to convert to utf-32 .

Thank you !