BinaryMuse / toml-node

TOML parser for Node.js and the Browser. Parses TOML v0.4.0

http://binarymuse.github.io/toml-node/

MIT License

304 stars 25 forks source link

toml parsing is too slow #39

Open kuangyunsheng opened 7 years ago

kuangyunsheng commented 7 years ago

I compare toml-node and js-yaml to load toml/yaml file with a large array config.

demo.toml:

[[arr]] name = "abcdefg"

demo.yaml: arr:

name: "abcdefg"

and, repeat the same element 1000 times in each file.

then parse them each other with toml-node and js-yaml, the result is: toml costs 662 ms yaml costs 33 ms

the toml parsing is so slow

kuangyunsheng commented 7 years ago

and I try to parse the same toml file with toml-j0.4, it only cost 45 ms.

felix9 commented 7 years ago

This looks like it's mostly compile() being slow, and I think the best fix is to rewrite it.

@BinaryMuse do you mind if I rewrite compile.js? I'd probably also add more testing.

BinaryMuse commented 7 years ago

I opened an PR here to begin to address this: https://github.com/BinaryMuse/toml-node/pull/42

Some interesting findings so far.

felix9 commented 7 years ago

ok, it looks like the grammar is slow.

This is a benchmark on my laptop of parsing a realistic cargo.toml file from the Rust Cargo project:

parse cargo with js-yaml       x 9,848 ops/sec ±2.25% (85 runs sampled)
parse cargo with json          x 64,939 ops/sec ±0.80% (88 runs sampled)
parse cargo with toml-j0.4     x 2,833 ops/sec ±1.87% (84 runs sampled)
parse cargo with toml-node     x 170 ops/sec ±1.07% (80 runs sampled)

I modified toml.pegjs to remove all actions, so it's just doing parsing, and it's still worse than toml-j0.4 (which also uses pegjs).

parse cargo with toml-node     x 329 ops/sec ±1.79% (85 runs sampled)

So I'm now going to try to figure out what's slow in the grammar

felix9 commented 7 years ago

pegjs doesn't generate very efficient parsers. In particular, every rule is a function call, So rules like:

line
  = S* expr:expression S* comment* (NL+ / EOF)
S                = [ \t]

will call the S function in a loop to satisfy the *. Also, the pegjs --cache option increases the cost of each S call.

Rules like this are faster:

line
  = ws? expr:expression ws? comment* (NL+ / EOF)
ws                = [ \t]+

pegjs still matches [ \t] one char at a time, but it's in a while loop in the ws function.

There doesn't seem to be a way right now to get pegjs to use a /[ \t]+/ regex match instead of repeating /[ \t]/ in a loop.

So... rewriting the grammar slightly to reduce function calls will probably yield a fairly substantial speed increase, and then I'll look at other problems.

felix9 commented 7 years ago

I've gotten part of the way to toml-j0.4 performance for the cargo.toml case just by refactoring the grammar.

But there's a large performance gap that's due to toml-node calling line() and column() all the time. I'm not sure what to do about that yet. Options:

Instead of line() and column(), use offset(), which is cheap, and then recompute line/column in the error reporter. Unfortunately, pegjs-0.10 gets rid of offset() and replaces it with location(), which is as expensive as line() and column().
Fold compile into parse, so we don't have to save line/column.

1 seems fairly easy, I think I'll try that first.

felix9 commented 7 years ago

44 and #45 combined makes toml-node about half the speed of toml-j0.4 for parsing my cargo.toml test case.

Also removing the pegjs --cache option will make toml-node about the same speed as toml-j0.4. I think the pegjs caching is unnecessary for this grammar, but I haven't worked through it fully yet.

There are more things to improve; toml parsing should be about as fast as yaml parsing, and potentially faster.