Open mingodad opened 1 year ago
With the script bellow we can build this project for wasm
and run on node
:
#!/bin/sh
umask 022
emsdk-env em++ \
-std=c++17 -O2 -Wall -Wno-unused-function -pedantic \
-sEXPORTED_FUNCTIONS=_main \
-sALLOW_MEMORY_GROWTH -s NODERAWFS=1 -sASSERTIONS \
-sNO_DISABLE_EXCEPTION_CATCHING \
main.cpp \
-o gram_grep.js
/usr/bin/time node-env node gram_grep.js -f postgres16.g test.sql
Warning: Token "UMINUS" does not have a lexer definiton.
./test.sql(1):select a,b,c from t where a = 1;
Matches: 1 Matching files: 1 Total files searched: 1
8.38user 0.09system 0:08.11elapsed 104%CPU (0avgtext+0avgdata 143268maxresident)k
0inputs+0outputs (0major+35815minor)pagefaults 0swaps
Better construction performance will be a longer term goal.
I wonder if it's worth supporting multiple state machine construction methods too, just in case faster construction methods prove to be unreliable.
Using an index in parsertl::generator::build_table
reduces the time to parse postgres16.g
from 12s
to 3s
:
static void build_table(const rules& rules_, const dfa& dfa_,
const prod_vector& new_grammar_, const nt_info_vector& new_nt_info_,
sm& sm_, std::string& warnings_)
{
const grammar& grammar_ = rules_.grammar();
const std::size_t start_ = rules_.start();
const std::size_t terminals_ = rules_.tokens_info().size();
const std::size_t non_terminals_ = rules_.nt_locations().size();
string_vector symbols_;
const std::size_t columns_ = terminals_ + non_terminals_;
std::size_t index_ = 0;
#ifdef PARSERTL_WITH_BUILD_TABLE_INDEX
struct grammar_idx {
dfa_size_t lhs, prod_lhs, rhs_back;
const std::pair<symbol_vector, string>* rhs;
grammar_idx(dfa_size_t the_lhs, dfa_size_t the_prod_lhs,
dfa_size_t the_rhs_back, const std::pair<symbol_vector,
string>* the_rhs):
lhs(the_lhs), prod_lhs(the_prod_lhs),
rhs_back(the_rhs_back), rhs(the_rhs)
{}
};
std::vector<grammar_idx> new_grammar_idx_vec;
new_grammar_idx_vec.reserve(new_grammar_.size());
for (const auto& p_ : new_grammar_)
{
new_grammar_idx_vec.emplace_back(p_._production->_lhs, p_._lhs,
p_._rhs_indexes.back().second, &(p_._production->_rhs));
}
#endif
rules_.symbols(symbols_);
sm_._columns = columns_;
sm_._rows = dfa_.size();
sm_.push();
//int loop0_count = 0, loop1_count = 0, loop2_count = 0, loop3_count = 0, loop31_count = 0, loop4_count = 0;
for (const auto& d_ : dfa_)
{
//++loop0_count;
// shift and gotos
for (const auto& tran_ : d_._transitions)
{
//++loop1_count;
const dfa_size_t id_ = tran_.first;
entry lhs_ = sm_.at(index_, id_);
const entry rhs_((id_ < terminals_) ?
// TERMINAL
action::shift :
// NON_TERMINAL
action::go_to,
static_cast<id_type>(tran_.second));
if (fill_entry(rules_, d_._closure, symbols_,
lhs_, id_, rhs_, warnings_))
sm_.set(index_, id_, lhs_);
}
// reductions
for (const auto& c_ : d_._closure)
{
//++loop2_count;
const production& production_ = grammar_[c_.first];
if (production_._rhs.first.size() == c_.second)
{
char_vector follow_set_(terminals_, 0);
// config is reduction
#ifdef PARSERTL_WITH_BUILD_TABLE_INDEX
for (const auto& p_ : new_grammar_idx_vec)
{
//++loop3_count;
if (production_._lhs == p_.lhs &&
production_._rhs == *(p_.rhs) &&
index_ == p_.rhs_back)
{
//++loop31_count;
const std::size_t lhs_id_ = p_.prod_lhs;
set_union(follow_set_,
new_nt_info_[lhs_id_]._follow_set);
}
}
#else
for (const auto& p_ : new_grammar_)
{
//++loop3_count;
if (production_._lhs == p_._production->_lhs &&
production_._rhs == p_._production->_rhs &&
index_ == p_._rhs_indexes.back().second)
{
//++loop31_count;
const std::size_t lhs_id_ = p_._lhs;
set_union(follow_set_,
new_nt_info_[lhs_id_]._follow_set);
}
}
#endif
for (std::size_t id_ = 0, size_ = follow_set_.size();
id_ < size_; ++id_)
{
//++loop4_count;
if (!follow_set_[id_]) continue;
entry lhs_ = sm_.at(index_, id_);
const entry rhs_(production_._lhs == start_ ?
action::accept :
action::reduce,
static_cast<id_type>(production_._index));
if (fill_entry(rules_, d_._closure, symbols_,
lhs_, id_, rhs_, warnings_))
sm_.set(index_, id_, lhs_);
}
}
}
++index_;
}
//printf("loop_count = %d : %d : %d : %d : %d : %d\n", loop0_count, loop1_count, loop2_count, loop3_count, loop31_count, loop4_count);
}
Also I added some measurements to help understand where the time and memory are used, bellow is the measurements for building postgres16.g
grammar, notice how memory usage jumps from 17.477.632
to 90.550.272
after calling parsertl::generator::rewrite
.
I had the impression that rewrite only applies to grammars using EBNF
but postgres16.g
doesn't use it.
printf("%s: Time taken %d seconds %d milliseconds, %'zd : %'zd : %'zd : %'zd\n",
title, msec/1000, msec%1000, getCurrentRSS(), getPeakRSS(), malloc_count, free_count);
build parser state machine start: Time taken 0 seconds 11 milliseconds, 5582848 : 5582848 : 121202 : 111468
build parser state machine start2: Time taken 0 seconds 0 milliseconds, 5582848 : 5582848 : 121204 : 111468
rules_.validate: Time taken 0 seconds 11 milliseconds, 5582848 : 5582848 : 121217 : 111479
build_dfa: Time taken 0 seconds 196 milliseconds, 17477632 : 17477632 : 1176878 : 1150361
rewrite: Time taken 0 seconds 209 milliseconds, 90550272 : 90550272 : 2371461 : 1350405
build_first_sets: Time taken 0 seconds 74 milliseconds, 90550272 : 90550272 : 2371461 : 1350405
build_follow_sets: Time taken 0 seconds 21 milliseconds, 90550272 : 90550272 : 2371461 : 1350405
build_table: Time taken 3 seconds 423 milliseconds, 115961856 : 115961856 : 2406554 : 1379276
copy_rules: Time taken 0 seconds 0 milliseconds, 115961856 : 115961856 : 2412870 : 1382495
build parser state machine end: Time taken 0 seconds 0 milliseconds, 115961856 : 115961856 : 2412870 : 1382495
Warning: Token "UMINUS" does not have a lexer definiton.
build user parser: Time taken 0 seconds 39 milliseconds, 81276928 : 115961856 : 2559197 : 2540137
Adding a debug measure in parsertl::generator::rewrite
and comparing two grammars:
printf("new_grammar : %d : %d\n", (int)new_grammar_.size(), (int)dfa_.size());
for (std::size_t sidx_ = 0, ssize_ = dfa_.size(); sidx_ != ssize_; ++sidx_)
{
...
}
printf("new_grammar : %d : %d\n", (int)new_grammar_.size(), (int)dfa_.size());
cql.g
new_grammar : 0 : 1316
new_grammar : 13241 : 1316
postgres16.g
new_grammar : 0 : 6221
new_grammar : 482123 : 6221
What is the main factor to explain the huge difference between then ?
With your speed-up, is that due to cache locality? I'm unclear why it is so much faster.
With regards to the state machine construction, I am using section 9.7.1.4 LALR(1) by Converting to SLR(1) in the book "PARSING TECHNIQUES A Practical Guide" by Dick Grune and Ceriel J.H. Jacobs.
It is mentioned that only the look-ahead sets of reduce items in the inadequate states need to be computed, however this is not an optimisation I am doing and I suspect accounts for the large use of memory.
Yes, my changes only take advantage of cache locality, I also have defined dfa_size_t
to be uint16_t
to improve it.
I dumped the result from parsertl::generator::rewrite
and found this:
Total unique production->_rhs 66 of 218 -> master parser
Total unique production->_rhs 596 of 13241 -> cql.g
Total unique production->_rhs 2227 of 482123 -> postgres16.g
I did an output of all the entries that are used here :
for (const auto& p_ : new_grammar_)
{
//++loop3_count;
if (production_._lhs == p_._production->_lhs &&
production_._rhs == p_._production->_rhs &&
index_ == p_._rhs_indexes.back().second)
{
//// All p_ that satisfies the conditions above
}
}
And found that all of then will satisfy the conditions but in unknown order, if we could create new_grammar
in parsertl::generator::rewrite
ordered we could index it directly instead of going through a loop over all of then.
Doing another experiment with indexing/sorting newgrammar I've got parsing postgres16.g
down to 700ms
(1.6s
on wasm), it's a strange use of refined brute force.
static void build_table(const rules& rules_, const dfa& dfa_,
const prod_vector& new_grammar_, const nt_info_vector& new_nt_info_,
sm& sm_, std::string& warnings_)
{
const grammar& grammar_ = rules_.grammar();
const std::size_t start_ = rules_.start();
const std::size_t terminals_ = rules_.tokens_info().size();
const std::size_t non_terminals_ = rules_.nt_locations().size();
string_vector symbols_;
const std::size_t columns_ = terminals_ + non_terminals_;
std::size_t index_ = 0;
#ifdef PARSERTL_WITH_BUILD_TABLE_INDEX
struct grammar_idx {
dfa_size_t lhs, prod_lhs, rhs_back;
const std::pair<symbol_vector, string>* rhs;
grammar_idx(dfa_size_t the_lhs, dfa_size_t the_prod_lhs,
dfa_size_t the_rhs_back, const std::pair<symbol_vector,
string>* the_rhs):
lhs(the_lhs), prod_lhs(the_prod_lhs),
rhs_back(the_rhs_back), rhs(the_rhs)
{}
};
std::vector<grammar_idx> new_grammar_idx_vec;
new_grammar_idx_vec.reserve(new_grammar_.size());
for (const auto& p_ : new_grammar_)
{
new_grammar_idx_vec.emplace_back(p_._production->_lhs, p_._lhs,
p_._rhs_indexes.back().second, &(p_._production->_rhs));
}
auto cmp_grammar_idx = [] (grammar_idx const& a, grammar_idx const& b) {
if(a.lhs < b.lhs) return true;
if(a.lhs == b.lhs && a.rhs_back < b.rhs_back) return true;
return false;
};
std::sort(new_grammar_idx_vec.begin(), new_grammar_idx_vec.end(),
cmp_grammar_idx);
#endif
rules_.symbols(symbols_);
sm_._columns = columns_;
sm_._rows = dfa_.size();
sm_.push();
#ifdef PARSERTL_WITH_BITSET
char_vector follow_set_(terminals_);
#endif
for (const auto& d_ : dfa_)
{
// shift and gotos
for (const auto& tran_ : d_._transitions)
{
const dfa_size_t id_ = tran_.first;
entry lhs_ = sm_.at(index_, id_);
const entry rhs_((id_ < terminals_) ?
// TERMINAL
action::shift :
// NON_TERMINAL
action::go_to,
static_cast<id_type>(tran_.second));
if (fill_entry(rules_, d_._closure, symbols_,
lhs_, id_, rhs_, warnings_))
sm_.set(index_, id_, lhs_);
}
// reductions
for (const auto& c_ : d_._closure)
{
const production& production_ = grammar_[c_.first];
if (production_._rhs.first.size() == c_.second)
{
#ifdef PARSERTL_WITH_BITSET
follow_set_.set_clear();
#else
char_vector follow_set_(terminals_, 0);
#endif
// config is reduction
#ifdef PARSERTL_WITH_BUILD_TABLE_INDEX
grammar_idx gidx_start(production_._lhs, 0, index_, nullptr);
auto it = std::lower_bound(new_grammar_idx_vec.begin(), new_grammar_idx_vec.end(), gidx_start,
cmp_grammar_idx);
while(it != new_grammar_idx_vec.end() && it->lhs == production_._lhs && index_ == it->rhs_back)
{
if (production_._rhs == *(it->rhs))
{
const std::size_t lhs_id_ = it->prod_lhs;
set_union(follow_set_,
new_nt_info_[lhs_id_]._follow_set);
}
++it;
}
#else
for (const auto& p_ : new_grammar_)
{
if (production_._lhs == p_._production->_lhs &&
production_._rhs == p_._production->_rhs &&
index_ == p_._rhs_indexes.back().second)
{
const std::size_t lhs_id_ = p_._lhs;
set_union(follow_set_,
new_nt_info_[lhs_id_]._follow_set);
}
}
#endif
for (std::size_t id_ = 0, size_ = follow_set_.size();
id_ < size_; ++id_)
{
if (!follow_set_[id_]) continue;
entry lhs_ = sm_.at(index_, id_);
const entry rhs_(production_._lhs == start_ ?
action::accept :
action::reduce,
static_cast<id_type>(production_._index));
if (fill_entry(rules_, d_._closure, symbols_,
lhs_, id_, rhs_, warnings_))
sm_.set(index_, id_, lhs_);
}
}
}
++index_;
}
}
Here is the instrumented code output:
/usr/bin/time playground-nb/dist/Release/GNU-Linux/playground-nb "postgres16.g" test.sql
Starting: Time taken 0 seconds 3 milliseconds, 4005888 : 4005888 : 15040 : 14982
read user grammar: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 15040 : 14982
read input: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 15040 : 14982
build parser state machine start: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 16512 : 16279
build parser state machine start2: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 16514 : 16279
rules_.validate: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 16527 : 16290
build_dfa: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 17996 : 17437
rewrite: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 19239 : 17990
build_first_sets: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 19239 : 17990
build_follow_sets: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 19239 : 17990
build_table: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 19724 : 18349
copy_rules: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 19904 : 18449
build parser state machine end: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 19904 : 18449
build master parser: Time taken 0 seconds 5 milliseconds, 5025792 : 5025792 : 71521 : 70231
parse user grammar: Time taken 0 seconds 2 milliseconds, 5025792 : 5025792 : 71526 : 70235
Parser user grammar success: 1
build parser state machine start: Time taken 0 seconds 14 milliseconds, 5771264 : 5771264 : 122702 : 112911
build parser state machine start2: Time taken 0 seconds 0 milliseconds, 5771264 : 5771264 : 122704 : 112911
rules_.validate: Time taken 0 seconds 6 milliseconds, 5771264 : 5771264 : 122717 : 112922
build_dfa: Time taken 0 seconds 160 milliseconds, 12529664 : 12529664 : 1145870 : 1119296
rewrite: Time taken 0 seconds 172 milliseconds, 82157568 : 82157568 : 2340453 : 1319340
build_first_sets: Time taken 0 seconds 74 milliseconds, 82157568 : 82157568 : 2340453 : 1319340
build_follow_sets: Time taken 0 seconds 20 milliseconds, 82157568 : 82157568 : 2340453 : 1319340
build_table: Time taken 0 seconds 243 milliseconds, 103514112 : 103514112 : 2375546 : 1348211
copy_rules: Time taken 0 seconds 0 milliseconds, 103514112 : 103514112 : 2381862 : 1351430
build parser state machine end: Time taken 0 seconds 0 milliseconds, 103514112 : 103514112 : 2381862 : 1351430
Warning: Token "UMINUS" does not have a lexer definiton.
build user parser: Time taken 0 seconds 36 milliseconds, 69029888 : 103514112 : 2528189 : 2509072
Productions 3283.
Terminals 514.
NonTerminals 706.
States 6221.
Lexer States 1.
Lexer State0 127872.
Shift/Reduce conflicts resolved 1454.
Reduce/Reduce conflicts resolved 0.
parse input: Time taken 0 seconds 0 milliseconds, 69029888 : 103514112 : 2528194 : 2509076
Parser input success: 1
0.67user 0.07system 0:00.74elapsed 100%CPU (0avgtext+0avgdata 101088maxresident)k
0inputs+0outputs (0major+33611minor)pagefaults 0swaps
I just created a monolith repository for the playground here https://github.com/mingodad/parsertl-playground and the online playground can be viewed here https://mingodad.github.io/parsertl-playground/playground/
This project works really well so far, it has a bit of performance issue for big grammars like postgresql (see attached and the timings bellow).
See attached naive conversion of postgresql-16 grammar.
Using https://github.com/mingodad/lalr/tree/playground :
Using bison:
postgres16.g.zip