Closed cwbaker closed 1 year ago
Nice to see that you are fixing the reported issues !
Still waiting for the BitArray
to land with improved memory usage and enhanced performance.
Nice to see that you are fixing the reported issues! Still waiting for the
BitArray
to land with improved memory usage and enhanced performance.
You should give the latest main a try. Those changes and more efficient algorithms to generate states and propagate lookaheads are in place since 7f1b4c97. Compiling the PostgreSQL grammar takes ~7s now vs. Bisons ~2s on my local machine.
I think most of the issues reported in #11 are addressed. I'll reply to those things I didn't fix in the issue itself, but tomorrow because it's getting late here.
It seems that there is a problem with missing -lpthread
:
forge
...
/tmp/lalr/debug/lib/liblalr_linux_native.a(ThreadPool.o): In function `std::thread::thread<void (lalr::ThreadPool::*)(), lalr::ThreadPool*, void>(void (lalr::ThreadPool::*&&)(), lalr::ThreadPool*&&)':
/usr/include/c++/9/thread:126: undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
/home/mingo/local/forge/lua/forge/init.lua:58: g++ -march=native -std=c++11 -g -o "/tmp/lalr/debug/bin/lalrc" -L "/tmp/lalr/debug/lib" "dot.o" "lalrc.o" -llalr_linux_native failed
forge: 'all' failed for lack of 'lalrc'.
...
Inlining the trivial methods can reduce the time to parser postgresql to around 4s but the memory usage is still a bit high: Your last commit:
/usr/bin/time /home/mingo/dev/c/A_grammars/lalr-last0/release/bin/lalrc postgresql.g > /dev/null
6.98user 0.35system 0:04.32elapsed 169%CPU (0avgtext+0avgdata 889332maxresident)k
0inputs+0outputs (0major+233264minor)pagefaults 0swaps
With trivial methods inlined:
/usr/bin/time /home/mingo/dev/c/A_grammars/lalr-last/release/bin/lalrc postgresql.g > /dev/null
4.46user 0.39system 0:03.44elapsed 141%CPU (0avgtext+0avgdata 889216maxresident)k
0inputs+0outputs (0major+233233minor)pagefaults 0swaps
My previous changes using a BitArray
, notice the memory usage:
/usr/bin/time /home/mingo/dev/c/A_grammars/lalr/lalr-nb/dist/Release/GNU-Linux/lalr-nb postgresql.g > /dev/null
21.67user 0.13system 0:21.81elapsed 99%CPU (0avgtext+0avgdata 270072maxresident)k
0inputs+0outputs (0major+66870minor)pagefaults 0swaps
I'm not sure how to test this new json parser with https://github.com/tree-sitter/tree-sitter-verilog/blob/master/src/grammar.json (7.4MB) and https://github.com/tree-sitter/tree-sitter-cpp/blob/master/src/grammar.json (300KB) ?
With this json parser:
// /usr/bin/time test-grammar-nb -g json2.g -i BytecodeList.json
// /usr/bin/time test-grammar-nb -g json2.g -i grammar.json
json {
%whitespace "[ \t\r\n]*";
document: object | array [document] ;
value: 'null' [value] | 'true' [value] | 'false' [value] | integer [value] | real [value] | string [value] | array [value] | object [value];
object: '{' kv_list '}' [object];
array: '[' value_list ']' [array];
value_list :
/*empty*/
| value
| value_list ',' value
;
kv_list:
/*empty*/
| key_val
| kv_list ',' key_val
;
key_val:
string ':' value [key_val]
;
integer: "(\+|\-)?[0-9]+";
real: "(\+|\-)?[0-9]+(\.[0-9]+)?((e|E)(\+|\-)?[0-9]+)?";
string: "[\"](\\[\"\\]|[^\"\n])*[\"]|['](\\['\\]|[^'\n])*[']";
}
It parses fine tree-sitter-cpp
:
/usr/bin/time ./grammar_test -g json2.g -i grammar.json
read grammar: Time taken 0 seconds 0 milliseconds
Grammar size = 953
compile grammar: Time taken 0 seconds 6 milliseconds
read input: Time taken 0 seconds 0 milliseconds
Input size = 308231
parse input: Time taken 0 seconds 38 milliseconds
accepted = 1, full = 1
0.04user 0.00system 0:00.04elapsed 104%CPU (0avgtext+0avgdata 4212maxresident)k
0inputs+0outputs (0major+283minor)pagefaults 0swaps
But on parsing tree-sitter-verilog
:
/usr/bin/time ./grammar_test -g json2.g -i grammar-big.json read grammar: Time taken 0 seconds 0 milliseconds
Grammar size = 953
compile grammar: Time taken 0 seconds 8 milliseconds
read input: Time taken 0 seconds 6 milliseconds
Input size = 7763660
lalr (5552): ERROR: Syntax error on '"' when expecting object, array, value, null_terminal, true_terminal, false_terminal, left_curly_brace_terminal, left_square_paren_terminal, integer, real, string
parse input: Time taken 0 seconds 17 milliseconds
accepted = 0, full = 0
0.02user 0.00system 0:00.03elapsed 105%CPU (0avgtext+0avgdata 11316maxresident)k
0inputs+0outputs (0major+2104minor)pagefaults 0swaps
I have added this to help debug grammars:
----------------------------- src/lalr/Parser.hpp -----------------------------
index a9e2a69..85e7674 100644
@@ -79,6 +79,7 @@ public:
void parse( Iterator start, Iterator finish );
bool parse( const void* symbol, const std::basic_string<Char, Traits, Allocator>& lexeme, int line, int column );
bool parse( const ParserSymbol* symbol, const std::basic_string<Char, Traits, Allocator>& lexeme, int line, int column );
+ void dumpLex( Iterator start, Iterator finish );
private:
const ParserTransition* find_transition( const ParserSymbol* symbol, const ParserState* state ) const;
----------------------------- src/lalr/Parser.ipp -----------------------------
index 30f5670..3e8c3cb 100644
@@ -146,6 +146,32 @@ const UserData& Parser<Iterator, UserData, Char, Traits, Allocator>::user_data()
return user_data_.front();
}
+template <class Iterator, class UserData, class Char, class Traits, class Allocator>
+void Parser<Iterator, UserData, Char, Traits, Allocator>::dumpLex( Iterator start, Iterator finish )
+{
+ LALR_ASSERT( state_machine_ );
+
+ reset();
+ lexer_.reset( start, finish );
+ lexer_.advance();
+ while ( !lexer_.full() )
+ {
+ const ParserSymbol* symbol = reinterpret_cast<const ParserSymbol*>( lexer_.symbol() );
+ //printf("%d:%d:%d:%d:[%s]:[%s]:[%s]\n", lexer_.line(), lexer_.column(), symbol->type, symbol->index, symbol->identifier, symbol->lexeme, lexer_.lexeme().c_str());
+ //printf("%d:%d:%d:%d:[%s]\n", lexer_.line(), lexer_.column(), symbol->type, symbol->index, symbol->identifier);
+ //printf("%d:%d:%d:%d\n", lexer_.line(), lexer_.column(), symbol->type, symbol->index);
+ //printf("%d:%d:%p\n", lexer_.line(), lexer_.column(), symbol);
+ printf("%d:%d:", lexer_.line(), lexer_.column());
+ if(symbol) printf("%d:%d:[%s]:[%s]", symbol->type, symbol->index, symbol->identifier, symbol->lexeme);
+ else printf("-1:-1:[]:[]");
+ printf(":[%s]\n", lexer_.lexeme().c_str());
+ //fflush(stdout);
+ lexer_.advance();
+ }
+
+ full_ = lexer_.full();
+}
+
/**
// Get the Lexer that is being used by this Parser.
And then applied it to tree-sitter-verilog
and got this:
/usr/bin/time ./grammar_test -g json2.g -i grammar-big.json -d > a.txt
...
grep '5552:' a.txt
5552:17:1:22:[string]:[[\"](\\[\"\\]|[^\"\n])*[\"]|['](\\['\\]|[^'\n])*[']]:["value"]
5552:24:1:19:[colon_terminal]:[:]:[:]
5552:26:-1:-1:[]:[]:["]
5552:27:-1:-1:[]:[]:[]
5552:31:-1:-1:[]:[]:["]
Here is the lines around 5552:
{
"type": "STRING",
"value": "–>" ///// 5552
},
If I change "value": "–>"
to "value": "arrow"
then it goes on to stop again on "value": "–"
so there is something wrong in the way the string pattern ["](\["\]|[^"\n])*["]
is converted/generated/interpreted because it seems to not accept punctuation marks inside it.
And now that I transcribed the "[\"](\\[\"\\]|[^\"\n])*[\"]"
pattern to ["](\["\]|[^"\n])*["]
I noticed \]
in the resulting pattern and then probably the correct way to write it is "[\"](\\[\"\\\\]|[^\"\n])*[\"]"
to end up with ["](\["\\]|[^"\n])*["]
but there is no error message about it and even using \\\\
in the pattern I'm still getting the same error on non alpha characters inside strings.
After a while I found one of the problems is that it's using utf8
characters, comparing with the tree-sitter-cpp
that also has an arrow
symbol:
"value": "–>" ///verilog
"value": "->" ///cpp
Probably the verillog
grammar author doesn't even know about it.
But any way how will lalr
handle this situations ?
utf8
?UTF-8 should be supported with a UTF-8 aware iterator. I say should only because I've never really tested it.
See #93 for the JSON example updated to parse a UTF-8 encoded JSON file.
Roughly the following gets it going, note the use of istreambuf_iterator<char32_t>
to load char32_t
values from an ifstream
and codecvt<char32_t, char, std::mbstate_t>
to convert UTF-8 bytes into char32_t
:
Parser<istreambuf_iterator<char32_t>, JsonUserData> parser( json_parser_state_machine );
// ...
using std::locale;
using std::codecvt;
using std::basic_ifstream;
using std::istreambuf_iterator;
std::basic_ifstream<char32_t> file( LALR_EXAMPLES "lalr_json_example.json", std::ios_base::binary );
file.imbue( locale(file.getloc(), new codecvt<char32_t, char, std::mbstate_t>) );
istreambuf_iterator<char32_t> input( file );
istreambuf_iterator<char32_t> input_end;
parser.parse( input, input_end );
LALR_ASSERT( parser.accepted() );
LALR_ASSERT( parser.full() );
What if instead of managing in utf-32 I want to mange in utf-8 but interpret it as unsigned char, because right now the problem seem to be interpreting char
as signed (remember on ARM
char
is unsigned).
GrammarCompiler only accept signed char
could it be a template
or accept unsigned char
?
int GrammarCompiler::compile( const char* begin, const char* end, ErrorPolicy* error_policy )
Forget the previous message for now, I'm more interested on parsing the input that contains utf-8
without need convert to utf-32
.
Ok !
Looking at your changes to allow utf-32
and using lalr::Parser<const unsigned char*, int> parser( compiler.parser_state_machine(), &error_policy_input );
then it parses the input containing utf-8
without need to convert to utf-32
.
Thank you !
Fixes problem reported as part of #11.