Open kuangyunsheng opened 7 years ago
and I try to parse the same toml file with toml-j0.4, it only cost 45 ms.
This looks like it's mostly compile() being slow, and I think the best fix is to rewrite it.
@BinaryMuse do you mind if I rewrite compile.js? I'd probably also add more testing.
I opened an PR here to begin to address this: https://github.com/BinaryMuse/toml-node/pull/42
Some interesting findings so far.
ok, it looks like the grammar is slow.
This is a benchmark on my laptop of parsing a realistic cargo.toml file from the Rust Cargo project:
parse cargo with js-yaml x 9,848 ops/sec ±2.25% (85 runs sampled)
parse cargo with json x 64,939 ops/sec ±0.80% (88 runs sampled)
parse cargo with toml-j0.4 x 2,833 ops/sec ±1.87% (84 runs sampled)
parse cargo with toml-node x 170 ops/sec ±1.07% (80 runs sampled)
I modified toml.pegjs to remove all actions, so it's just doing parsing, and it's still worse than toml-j0.4 (which also uses pegjs).
parse cargo with toml-node x 329 ops/sec ±1.79% (85 runs sampled)
So I'm now going to try to figure out what's slow in the grammar
pegjs doesn't generate very efficient parsers. In particular, every rule is a function call, So rules like:
line
= S* expr:expression S* comment* (NL+ / EOF)
S = [ \t]
will call the S function in a loop to satisfy the *. Also, the pegjs --cache option increases the cost of each S call.
Rules like this are faster:
line
= ws? expr:expression ws? comment* (NL+ / EOF)
ws = [ \t]+
pegjs still matches [ \t]
one char at a time, but it's in a while loop in the ws function.
There doesn't seem to be a way right now to get pegjs to use a /[ \t]+/
regex match instead of repeating /[ \t]/
in a loop.
So... rewriting the grammar slightly to reduce function calls will probably yield a fairly substantial speed increase, and then I'll look at other problems.
I've gotten part of the way to toml-j0.4 performance for the cargo.toml case just by refactoring the grammar.
But there's a large performance gap that's due to toml-node calling line() and column() all the time. I'm not sure what to do about that yet. Options:
1 seems fairly easy, I think I'll try that first.
Also removing the pegjs --cache
option will make toml-node about the same speed as toml-j0.4. I think the pegjs caching is unnecessary for this grammar, but I haven't worked through it fully yet.
There are more things to improve; toml parsing should be about as fast as yaml parsing, and potentially faster.
I compare toml-node and js-yaml to load toml/yaml file with a large array config.
demo.toml:
[[arr]] name = "abcdefg"
demo.yaml: arr:
and, repeat the same element 1000 times in each file.
then parse them each other with toml-node and js-yaml, the result is: toml costs 662 ms yaml costs 33 ms
the toml parsing is so slow