BenHanson / gram_grep

Search text using a grammar, lexer, or straight regex. Chain searches for greater refinement.
Boost Software License 1.0
11 stars 2 forks source link

Improving performance #5

Open mingodad opened 1 year ago

mingodad commented 1 year ago

This project works really well so far, it has a bit of performance issue for big grammars like postgresql (see attached and the timings bellow).

See attached naive conversion of postgresql-16 grammar.

gram-grep-nb -f "postgres16.g" test.sql
Warning: Token "UMINUS" does not have a lexer definiton.
./test.sql(1):select a,b,c from t where a = 1;
./test.sql(1):select a,b,c from t where a = 1;
Matches: 2    Matching files: 1    Total files searched: 1
13.67user 0.09system 0:13.76elapsed 99%CPU (0avgtext+0avgdata 130824maxresident)k
0inputs+0outputs (0major+42066minor)pagefaults 0swaps

Using https://github.com/mingodad/lalr/tree/playground :

/usr/bin/time lalr-dad-base/test-grammar-nb/dist/Release/GNU-Linux/test-grammar-nb -g postgresql-16.g -i test-small.sql 
read grammar: Time taken 0 seconds 0 milliseconds
Grammar size = 139063
compile grammar: Time taken 3 seconds 73 milliseconds
Parser state machine stats:
    Symbols          : 1220
    Actions          : 0
    States           : 6218
    Transitions      : 959600
    Solved conflicts : shift/reduce = 578, reduce/reduce = 0
    Error recovery   : 0
Lexer state machine stats:
    Strings          : 0
    Actions          : 0
    States           : 2035
    Transitions      : 19428
read input: Time taken 0 seconds 0 milliseconds
Input size = 31
parse input: Time taken 0 seconds 0 milliseconds
default_handler_counters = call : 0, zero = 0
accepted = 1, full = 1
parse_toplevel
 |stmtmulti
 | |toplevel_stmt
 | | |stmt
 | | | |SelectStmt
 | | | | |select_no_parens
 | | | | | |simple_select
 | | | | | | |SELECT -> select
 | | | | | | |opt_all_clause
 | | | | | | |opt_target_list
 | | | | | | | |target_list
 | | | | | | | | |target_el
 | | | | | | | | | |a_expr
 | | | | | | | | | | |c_expr
 | | | | | | | | | | | |columnref
 | | | | | | | | | | | | |ColId
 | | | | | | | | | | | | | |IDENT -> a
 | | | | | | | | |comma_terminal -> ,
 | | | | | | | | |target_el
 | | | | | | | | | |a_expr
 | | | | | | | | | | |c_expr
 | | | | | | | | | | | |columnref
 | | | | | | | | | | | | |ColId
 | | | | | | | | | | | | | |IDENT -> b
 | | | | | | | | |comma_terminal -> ,
 | | | | | | | | |target_el
 | | | | | | | | | |a_expr
 | | | | | | | | | | |c_expr
 | | | | | | | | | | | |columnref
 | | | | | | | | | | | | |ColId
 | | | | | | | | | | | | | |IDENT -> c
 | | | | | | |into_clause
 | | | | | | |from_clause
 | | | | | | | |FROM -> from
 | | | | | | | |from_list
 | | | | | | | | |table_ref
 | | | | | | | | | |relation_expr
 | | | | | | | | | | |qualified_name
 | | | | | | | | | | | |ColId
 | | | | | | | | | | | | |IDENT -> t
 | | | | | | | | | |opt_alias_clause
 | | | | | | |where_clause
 | | | | | | | |WHERE -> where
 | | | | | | | |a_expr
 | | | | | | | | |c_expr
 | | | | | | | | | |columnref
 | | | | | | | | | | |ColId
 | | | | | | | | | | | |IDENT -> a
 | | | | | | | | |eq_terminal -> =
 | | | | | | | | |c_expr
 | | | | | | | | | |AexprConst
 | | | | | | | | | | |Iconst
 | | | | | | | | | | | |ICONST -> 1
 | | | | | | |group_clause
 | | | | | | |having_clause
 | | | | | | |window_clause
 | |semi_colon_terminal -> ;
 | |toplevel_stmt
 | | |stmt
2.92user 0.16system 0:01.55elapsed 199%CPU (0avgtext+0avgdata 374544maxresident)k
0inputs+0outputs (0major+103341minor)pagefaults 0swaps

Using bison:

/usr/bin/time mybison "postgresql16b2.y"
postgresql16b2.y:224.1-12: warning: deprecated directive: ‘%pure-parser’, use ‘%define api.pure’ [-Wdeprecated]
  224 | %pure-parser
      | ^~~~~~~~~~~~
      | %define api.pure
postgresql16b2.y:226.1-22: warning: deprecated directive: ‘%name-prefix="base_yy"’, use ‘%define api.prefix {base_yy}’ [-Wdeprecated]
  226 | %name-prefix="base_yy"
      | ^~~~~~~~~~~~~~~~~~~~~~
      | %define api.prefix {base_yy}
postgresql16b2.y: warning: fix-its can be applied.  Rerun with option '--update'. [-Wother]
2.31user 0.04system 0:02.30elapsed 102%CPU (0avgtext+0avgdata 18120maxresident)k
0inputs+5600outputs (0major+11701minor)pagefaults 0swaps

postgres16.g.zip

mingodad commented 1 year ago

With the script bellow we can build this project for wasm and run on node:

#!/bin/sh

umask 022

emsdk-env em++ \
    -std=c++17 -O2 -Wall -Wno-unused-function -pedantic \
    -sEXPORTED_FUNCTIONS=_main  \
    -sALLOW_MEMORY_GROWTH -s NODERAWFS=1 -sASSERTIONS \
    -sNO_DISABLE_EXCEPTION_CATCHING \
    main.cpp \
    -o gram_grep.js
/usr/bin/time node-env node gram_grep.js -f postgres16.g test.sql 
Warning: Token "UMINUS" does not have a lexer definiton.
./test.sql(1):select a,b,c from t where a = 1;
Matches: 1    Matching files: 1    Total files searched: 1
8.38user 0.09system 0:08.11elapsed 104%CPU (0avgtext+0avgdata 143268maxresident)k
0inputs+0outputs (0major+35815minor)pagefaults 0swaps
BenHanson commented 1 year ago

Better construction performance will be a longer term goal.

I wonder if it's worth supporting multiple state machine construction methods too, just in case faster construction methods prove to be unreliable.

mingodad commented 1 year ago

Using an index in parsertl::generator::build_table reduces the time to parse postgres16.g from 12s to 3s:

        static void build_table(const rules& rules_, const dfa& dfa_,
            const prod_vector& new_grammar_, const nt_info_vector& new_nt_info_,
            sm& sm_, std::string& warnings_)
        {
            const grammar& grammar_ = rules_.grammar();
            const std::size_t start_ = rules_.start();
            const std::size_t terminals_ = rules_.tokens_info().size();
            const std::size_t non_terminals_ = rules_.nt_locations().size();
            string_vector symbols_;
            const std::size_t columns_ = terminals_ + non_terminals_;
            std::size_t index_ = 0;
#ifdef PARSERTL_WITH_BUILD_TABLE_INDEX

            struct grammar_idx {
                dfa_size_t lhs, prod_lhs, rhs_back;
                const std::pair<symbol_vector, string>* rhs;
                grammar_idx(dfa_size_t the_lhs, dfa_size_t the_prod_lhs,
                    dfa_size_t the_rhs_back, const std::pair<symbol_vector,
                    string>* the_rhs):
                    lhs(the_lhs), prod_lhs(the_prod_lhs),
                    rhs_back(the_rhs_back), rhs(the_rhs)
                {}
            };
            std::vector<grammar_idx> new_grammar_idx_vec;
            new_grammar_idx_vec.reserve(new_grammar_.size());
            for (const auto& p_ : new_grammar_)
            {
                new_grammar_idx_vec.emplace_back(p_._production->_lhs, p_._lhs,
                        p_._rhs_indexes.back().second, &(p_._production->_rhs));
            }
#endif
            rules_.symbols(symbols_);
            sm_._columns = columns_;
            sm_._rows = dfa_.size();
            sm_.push();

            //int loop0_count = 0, loop1_count = 0, loop2_count = 0, loop3_count = 0, loop31_count = 0, loop4_count = 0;
            for (const auto& d_ : dfa_)
            {
                //++loop0_count;
                // shift and gotos
                for (const auto& tran_ : d_._transitions)
                {
                    //++loop1_count;
                    const dfa_size_t id_ = tran_.first;
                    entry lhs_ = sm_.at(index_, id_);
                    const entry rhs_((id_ < terminals_) ?
                        // TERMINAL
                        action::shift :
                        // NON_TERMINAL
                        action::go_to,
                        static_cast<id_type>(tran_.second));

                    if (fill_entry(rules_, d_._closure, symbols_,
                        lhs_, id_, rhs_, warnings_))
                        sm_.set(index_, id_, lhs_);
                }

                // reductions
                for (const auto& c_ : d_._closure)
                {
                    //++loop2_count;
                    const production& production_ = grammar_[c_.first];

                    if (production_._rhs.first.size() == c_.second)
                    {
                        char_vector follow_set_(terminals_, 0);

                        // config is reduction
#ifdef PARSERTL_WITH_BUILD_TABLE_INDEX
                        for (const auto& p_ : new_grammar_idx_vec)
                        {
                            //++loop3_count;
                            if (production_._lhs == p_.lhs &&
                                production_._rhs == *(p_.rhs) &&
                                index_ == p_.rhs_back)
                            {
                                //++loop31_count;
                                const std::size_t lhs_id_ = p_.prod_lhs;

                                set_union(follow_set_,
                                    new_nt_info_[lhs_id_]._follow_set);
                            }
                        }
#else
                        for (const auto& p_ : new_grammar_)
                        {
                            //++loop3_count;
                            if (production_._lhs == p_._production->_lhs &&
                                production_._rhs == p_._production->_rhs &&
                                index_ == p_._rhs_indexes.back().second)
                            {
                                //++loop31_count;
                                const std::size_t lhs_id_ = p_._lhs;

                                set_union(follow_set_,
                                    new_nt_info_[lhs_id_]._follow_set);
                            }
                        }
#endif
                        for (std::size_t id_ = 0, size_ = follow_set_.size();
                            id_ < size_; ++id_)
                        {
                            //++loop4_count;
                            if (!follow_set_[id_]) continue;

                            entry lhs_ = sm_.at(index_, id_);
                            const entry rhs_(production_._lhs == start_ ?
                                action::accept :
                                action::reduce,
                                static_cast<id_type>(production_._index));

                            if (fill_entry(rules_, d_._closure, symbols_,
                                lhs_, id_, rhs_, warnings_))
                                sm_.set(index_, id_, lhs_);
                        }
                    }
                }

                ++index_;
            }
            //printf("loop_count = %d : %d : %d : %d : %d : %d\n", loop0_count, loop1_count, loop2_count, loop3_count, loop31_count, loop4_count);
        }
mingodad commented 1 year ago

Also I added some measurements to help understand where the time and memory are used, bellow is the measurements for building postgres16.g grammar, notice how memory usage jumps from 17.477.632 to 90.550.272 after calling parsertl::generator::rewrite.

I had the impression that rewrite only applies to grammars using EBNF but postgres16.g doesn't use it.

printf("%s: Time taken %d seconds %d milliseconds, %'zd : %'zd : %'zd : %'zd\n", 
   title, msec/1000, msec%1000, getCurrentRSS(), getPeakRSS(), malloc_count, free_count);
build parser state machine start: Time taken 0 seconds 11 milliseconds, 5582848 : 5582848 : 121202 : 111468
build parser state machine start2: Time taken 0 seconds 0 milliseconds, 5582848 : 5582848 : 121204 : 111468
rules_.validate: Time taken 0 seconds 11 milliseconds, 5582848 : 5582848 : 121217 : 111479
build_dfa: Time taken 0 seconds 196 milliseconds, 17477632 : 17477632 : 1176878 : 1150361
rewrite: Time taken 0 seconds 209 milliseconds, 90550272 : 90550272 : 2371461 : 1350405
build_first_sets: Time taken 0 seconds 74 milliseconds, 90550272 : 90550272 : 2371461 : 1350405
build_follow_sets: Time taken 0 seconds 21 milliseconds, 90550272 : 90550272 : 2371461 : 1350405
build_table: Time taken 3 seconds 423 milliseconds, 115961856 : 115961856 : 2406554 : 1379276
copy_rules: Time taken 0 seconds 0 milliseconds, 115961856 : 115961856 : 2412870 : 1382495
build parser state machine end: Time taken 0 seconds 0 milliseconds, 115961856 : 115961856 : 2412870 : 1382495
Warning: Token "UMINUS" does not have a lexer definiton.
build user parser: Time taken 0 seconds 39 milliseconds, 81276928 : 115961856 : 2559197 : 2540137
mingodad commented 1 year ago

Adding a debug measure in parsertl::generator::rewrite and comparing two grammars:

            printf("new_grammar : %d : %d\n", (int)new_grammar_.size(), (int)dfa_.size());
            for (std::size_t sidx_ = 0, ssize_ = dfa_.size();  sidx_ != ssize_; ++sidx_)
            {
...
            }
            printf("new_grammar : %d : %d\n", (int)new_grammar_.size(), (int)dfa_.size());
cql.g
new_grammar : 0 : 1316
new_grammar : 13241 : 1316
postgres16.g
new_grammar : 0 : 6221
new_grammar : 482123 : 6221

What is the main factor to explain the huge difference between then ?

BenHanson commented 1 year ago

With your speed-up, is that due to cache locality? I'm unclear why it is so much faster.

With regards to the state machine construction, I am using section 9.7.1.4 LALR(1) by Converting to SLR(1) in the book "PARSING TECHNIQUES A Practical Guide" by Dick Grune and Ceriel J.H. Jacobs.

It is mentioned that only the look-ahead sets of reduce items in the inadequate states need to be computed, however this is not an optimisation I am doing and I suspect accounts for the large use of memory.

mingodad commented 1 year ago

Yes, my changes only take advantage of cache locality, I also have defined dfa_size_t to be uint16_t to improve it.

mingodad commented 1 year ago

I dumped the result from parsertl::generator::rewrite and found this:

Total unique production->_rhs   66 of 218 -> master parser
Total unique production->_rhs   596 of 13241 -> cql.g
Total unique production->_rhs   2227 of 482123 -> postgres16.g
mingodad commented 1 year ago

I did an output of all the entries that are used here :

                        for (const auto& p_ : new_grammar_)
                        {
                            //++loop3_count;
                            if (production_._lhs == p_._production->_lhs &&
                                production_._rhs == p_._production->_rhs &&
                                index_ == p_._rhs_indexes.back().second)
                            {
//// All p_ that satisfies the conditions above
                            }
                        }

And found that all of then will satisfy the conditions but in unknown order, if we could create new_grammar in parsertl::generator::rewrite ordered we could index it directly instead of going through a loop over all of then.

mingodad commented 1 year ago

Doing another experiment with indexing/sorting newgrammar I've got parsing postgres16.g down to 700ms (1.6s on wasm), it's a strange use of refined brute force.

        static void build_table(const rules& rules_, const dfa& dfa_,
            const prod_vector& new_grammar_, const nt_info_vector& new_nt_info_,
            sm& sm_, std::string& warnings_)
        {
            const grammar& grammar_ = rules_.grammar();
            const std::size_t start_ = rules_.start();
            const std::size_t terminals_ = rules_.tokens_info().size();
            const std::size_t non_terminals_ = rules_.nt_locations().size();
            string_vector symbols_;
            const std::size_t columns_ = terminals_ + non_terminals_;
            std::size_t index_ = 0;
#ifdef PARSERTL_WITH_BUILD_TABLE_INDEX

            struct grammar_idx {
                dfa_size_t lhs, prod_lhs, rhs_back;
                const std::pair<symbol_vector, string>* rhs;
                grammar_idx(dfa_size_t the_lhs, dfa_size_t the_prod_lhs,
                    dfa_size_t the_rhs_back, const std::pair<symbol_vector,
                    string>* the_rhs):
                    lhs(the_lhs), prod_lhs(the_prod_lhs),
                    rhs_back(the_rhs_back), rhs(the_rhs)
                {}
            };
            std::vector<grammar_idx> new_grammar_idx_vec;
            new_grammar_idx_vec.reserve(new_grammar_.size());
            for (const auto& p_ : new_grammar_)
            {
                new_grammar_idx_vec.emplace_back(p_._production->_lhs, p_._lhs,
                        p_._rhs_indexes.back().second, &(p_._production->_rhs));
            }

            auto cmp_grammar_idx = [] (grammar_idx const& a, grammar_idx const& b) {
                    if(a.lhs < b.lhs) return true;
                    if(a.lhs == b.lhs && a.rhs_back < b.rhs_back) return true;
                    return false;
                };
            std::sort(new_grammar_idx_vec.begin(), new_grammar_idx_vec.end(),
                cmp_grammar_idx);
#endif
            rules_.symbols(symbols_);
            sm_._columns = columns_;
            sm_._rows = dfa_.size();
            sm_.push();

#ifdef PARSERTL_WITH_BITSET
            char_vector follow_set_(terminals_);
#endif
            for (const auto& d_ : dfa_)
            {
                // shift and gotos
                for (const auto& tran_ : d_._transitions)
                {
                    const dfa_size_t id_ = tran_.first;
                    entry lhs_ = sm_.at(index_, id_);
                    const entry rhs_((id_ < terminals_) ?
                        // TERMINAL
                        action::shift :
                        // NON_TERMINAL
                        action::go_to,
                        static_cast<id_type>(tran_.second));

                    if (fill_entry(rules_, d_._closure, symbols_,
                        lhs_, id_, rhs_, warnings_))
                        sm_.set(index_, id_, lhs_);
                }

                // reductions
                for (const auto& c_ : d_._closure)
                {
                    const production& production_ = grammar_[c_.first];

                    if (production_._rhs.first.size() == c_.second)
                    {
#ifdef PARSERTL_WITH_BITSET
                        follow_set_.set_clear();
#else
                        char_vector follow_set_(terminals_, 0);
#endif

                        // config is reduction
#ifdef PARSERTL_WITH_BUILD_TABLE_INDEX
                        grammar_idx gidx_start(production_._lhs, 0, index_, nullptr);
                        auto it = std::lower_bound(new_grammar_idx_vec.begin(), new_grammar_idx_vec.end(), gidx_start,
                                cmp_grammar_idx);
                        while(it != new_grammar_idx_vec.end() && it->lhs == production_._lhs && index_ == it->rhs_back)
                        {
                            if (production_._rhs == *(it->rhs))
                            {
                                const std::size_t lhs_id_ = it->prod_lhs;

                                set_union(follow_set_,
                                    new_nt_info_[lhs_id_]._follow_set);
                            }
                            ++it;
                        }
#else
                        for (const auto& p_ : new_grammar_)
                        {
                            if (production_._lhs == p_._production->_lhs &&
                                production_._rhs == p_._production->_rhs &&
                                index_ == p_._rhs_indexes.back().second)
                            {
                                const std::size_t lhs_id_ = p_._lhs;

                                set_union(follow_set_,
                                    new_nt_info_[lhs_id_]._follow_set);
                            }
                        }
#endif
                        for (std::size_t id_ = 0, size_ = follow_set_.size();
                            id_ < size_; ++id_)
                        {
                            if (!follow_set_[id_]) continue;

                            entry lhs_ = sm_.at(index_, id_);
                            const entry rhs_(production_._lhs == start_ ?
                                action::accept :
                                action::reduce,
                                static_cast<id_type>(production_._index));

                            if (fill_entry(rules_, d_._closure, symbols_,
                                lhs_, id_, rhs_, warnings_))
                                sm_.set(index_, id_, lhs_);
                        }
                    }
                }

                ++index_;
            }
        }
mingodad commented 1 year ago

Here is the instrumented code output:

/usr/bin/time playground-nb/dist/Release/GNU-Linux/playground-nb "postgres16.g" test.sql
Starting: Time taken 0 seconds 3 milliseconds, 4005888 : 4005888 : 15040 : 14982
read user grammar: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 15040 : 14982
read input: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 15040 : 14982
build parser state machine start: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 16512 : 16279
build parser state machine start2: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 16514 : 16279
rules_.validate: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 16527 : 16290
build_dfa: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 17996 : 17437
rewrite: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 19239 : 17990
build_first_sets: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 19239 : 17990
build_follow_sets: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 19239 : 17990
build_table: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 19724 : 18349
copy_rules: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 19904 : 18449
build parser state machine end: Time taken 0 seconds 0 milliseconds, 4005888 : 4005888 : 19904 : 18449
build master parser: Time taken 0 seconds 5 milliseconds, 5025792 : 5025792 : 71521 : 70231
parse user grammar: Time taken 0 seconds 2 milliseconds, 5025792 : 5025792 : 71526 : 70235
Parser user grammar success: 1
build parser state machine start: Time taken 0 seconds 14 milliseconds, 5771264 : 5771264 : 122702 : 112911
build parser state machine start2: Time taken 0 seconds 0 milliseconds, 5771264 : 5771264 : 122704 : 112911
rules_.validate: Time taken 0 seconds 6 milliseconds, 5771264 : 5771264 : 122717 : 112922
build_dfa: Time taken 0 seconds 160 milliseconds, 12529664 : 12529664 : 1145870 : 1119296
rewrite: Time taken 0 seconds 172 milliseconds, 82157568 : 82157568 : 2340453 : 1319340
build_first_sets: Time taken 0 seconds 74 milliseconds, 82157568 : 82157568 : 2340453 : 1319340
build_follow_sets: Time taken 0 seconds 20 milliseconds, 82157568 : 82157568 : 2340453 : 1319340
build_table: Time taken 0 seconds 243 milliseconds, 103514112 : 103514112 : 2375546 : 1348211
copy_rules: Time taken 0 seconds 0 milliseconds, 103514112 : 103514112 : 2381862 : 1351430
build parser state machine end: Time taken 0 seconds 0 milliseconds, 103514112 : 103514112 : 2381862 : 1351430
Warning: Token "UMINUS" does not have a lexer definiton.
build user parser: Time taken 0 seconds 36 milliseconds, 69029888 : 103514112 : 2528189 : 2509072
Productions 3283.
Terminals 514.
NonTerminals 706.
States 6221.
Lexer States 1.
Lexer State0 127872.
Shift/Reduce conflicts resolved 1454.
Reduce/Reduce conflicts resolved 0.
parse input: Time taken 0 seconds 0 milliseconds, 69029888 : 103514112 : 2528194 : 2509076
Parser input success: 1
0.67user 0.07system 0:00.74elapsed 100%CPU (0avgtext+0avgdata 101088maxresident)k
0inputs+0outputs (0major+33611minor)pagefaults 0swaps
mingodad commented 1 year ago

I just created a monolith repository for the playground here https://github.com/mingodad/parsertl-playground and the online playground can be viewed here https://mingodad.github.io/parsertl-playground/playground/