Engelberg / instaparse

Eclipse Public License 1.0
2.74k stars 149 forks source link

consider using mutable data structures to improve performance #197

Closed carocad closed 4 years ago

carocad commented 4 years ago

Rationale If a tree falls in the woods, does it make a sound? If a pure function mutates some local data in order to produce an immutable return value, is that ok?

I am the author of parcera a Clojure parser which uses Instaparse as its parsing engine. I have rejected the idea of using things like antlr or other parser generators simply because Instaparse is too good in comparison.

I have shared my parser with everyone with the hopes of making it useful and quickly found out that its performance is around 100 times slower than the other options 😞 . I have followed your performance notes and taken care that my grammar definition doesnt contain any ambiguities by using generative testing. Even so, I still cannot match or at least get close to the performance of those other tools. I know that Clojure is not expected to be as fast as other languages, yet I would like my parser to be as fast as possible as long as no ambiguities are inside it.

I have checked the gll.cljc file and saw that the trampoline data structure actually uses several atoms and persistent data structures. I think that it should be possible to get a big performance boost by swapping those by mutable data structures; hence the quote above from the docs of Clojure on transients.

Is this something that you would be interested in ?

Engelberg commented 4 years ago

Sorry to hear it hasn't been performant enough for your needs.

I'd certainly be supportive of exploring the question of whether mutable data structures would make a significant improvement in performance, but now that instaparse supports both Clojure and Clojurescript, it would be hard to merge in changes involving mutable data structures that aren't cross-platform. So it would have to be a very significant performance difference to warrant merging in something that's different on the two platforms and would make it harder to maintain the project in the future.

I think it's quite likely that a lot of the performance difference is due to the more flexible algorithm rather than the mutability of the data structures, but I don't have any evidence to back this theory up. My long-term vision for instaparse would be to support multiple parsing algorithms working on the same basic grammar descriptions. So, if you happen to have an LL grammar, you can run the algorithm that is fastest for LL grammars, etc. The current GLL algorithm would be the fallback when you don't have a grammar that neatly falls into one of those categories.

That's the vision, but realistically I don't have any idea when I'll have the chance to pursue this.

carocad commented 4 years ago

I think it's quite likely that a lot of the performance difference is due to the more flexible algorithm rather than the mutability of the data structures, but I don't have any evidence to back this theory up.

I actually gave this a try until some days ago. I was using only Java data structures to see how much of an improvement could I get and at the end found out exactly what you described above. The mutable data structures do give a performance improvement but not by much 15 to 20% on my quick benchmarks. I didnt do anything on Javascript since this was just an experiment so I would expect less of a gain there :/

So, if you happen to have an LL grammar, you can run the algorithm that is fastest for LL grammars, etc.

For the time being I am exploring antlr since the big part of creating a grammar was already done with Instaparse :) . If things go well I might consider creating a "translator" between instaparse and antlr for grammars that are non-ambigous. We will see where this experiments lead πŸ˜