HtmlUnit / htmlunit-cssparser

CSS parser used by HtmlUnit
Apache License 2.0
5 stars 9 forks source link

Lookahead(2) causes really poor parser performance #19

Closed svschouw-bb closed 1 year ago

svschouw-bb commented 1 year ago

Even with the recently improved :not handling the performance of the CSS parser is still really poor. Over 50% of the CPU of our HtmlUnit tests is spent parsing CSS, and if I interpret the profiler data correctly the cause is the LOOKAHEAD(2) in CSS3Parser.jj:1111

JavaCC seems to implement LOOKAHEAD(2) by just running a side-effect free version of the parser, and as soon as 2 tokens have been correctly parsed, it throws a pre-constructed LookaheadSuccess throwable, restores the parser state and starts parsing for real. Throwing exceptions is expensive and the method that throws this exception is really high in the "self-time" of the profiler (actually the method itself is nowhere to be found, but various other methods that call this method are; I presume the compiler inlines the jj_scan_token(int) method that does it).

My guess is that the LOOKAHEAD(2) is there to differentiate between div span (descendant selector) and div {. Since the presence of whitespace after the selector (div) does not tell you whether you are done parsing selectors yet. My experience with parsers is that in general you want to skip all whitespace from the token stream. This way instead of having to differentiate between <IDENT> <S> <IDENT> vs <IDENT> <S> <LBRACE> you differentiate between <IDENT> <IDENT> vs <IDENT> <LBRACE>, making the parser LL(1), which should improve the performance a lot.

Is there any reason to keep the whitespace in the token stream? Is there an annoying CSS feature that requires this?

If I have time I might look into creating a pull request, but my experience with JavaCC is mostly theoretical.

rbri commented 1 year ago

@svschouw-bb thanks for the report.

Maybe you can provide some samples to have test cases. And of course every pr is welcome.

svschouw-bb commented 1 year ago

Test cases are pretty hard to create, since this is a performance issue, not a bug. I hope the unittest set is complete enough that any refactor of the parser does not break existing CSS parsing.

rbri commented 1 year ago

there is always a test missing :-)

Will also try to have a look - maybe we can get rid of the lookahead but i have to wake up my memories first

svschouw-bb commented 1 year ago

Is there any reason to keep the whitespace in the token stream? Is there an annoying CSS feature that requires this?

Yes, there is, otherwise it is not possible to distinguish between div .foo and div.foo... sigh. This complicates matters.

rbri commented 1 year ago

css parsing is horrible - but i did already some minor cleanup

svschouw-bb commented 1 year ago

As far as I can tell, the following LOOKAHEAD can be used instead of LOOKAHEAD(2):

LOOKAHEAD(combinator() ( <IDENT> | <ASTERISK> | <HASH> | <DOT> | <LSQUARE> | <COLON> ))

Only downside is that you have to manually specify which are all the possible first tokens in simpleSelector. The upside is that there aren't that many.

Alternatively, using LOOKAHEAD(100) could also work. This makes sure the LookaheadSuccess is only thrown if after 100 tokens we're still in the simpleSelector. But parsing a 100 tokens is (probably) still faster than a throw/catch. The biggest downside is that it makes the parser considerably more complex, because it needs to create a shadow-copy of the parser of everything that can appear in a simpleSelector (with :not(<selector>) being the most complex case).

As for unittests, with the first solution there should probably be tests for all selector combinators.

One issue I'm having is that it's not that easy to go to HtmlUnit 3.0.0, because Spring still uses the 2.xx line. I have a local branch (without tests) that is based on 1.14.0. Both solutions significantly improve the test performance of our htmlunit-based tests (the CSS parser is all but disappeared from the profiler).

rbri commented 1 year ago

Have experimented today wit a similar solution. Looks like we are on the right path :-)

rbri commented 1 year ago

Hi @svschouw-bb

have added your an my solution to a new branch - see https://github.com/HtmlUnit/htmlunit-cssparser/commit/1b63adfba840d50cc7c0f7ed409a9dbf2498c002

I can't find any real difference between all these implementation but maybe you have some more test cases. Please have a look....

svschouw-bb commented 1 year ago

The difference is gigantic to me. I've turned the number of iterations on the peformance test down to 5, otherwise it takes too long. With the LOOKAHEAD(2) implementation it takes about 50 seconds to run. With the other two it takes about 5 seconds on my machine.

I think version 3 (isCombinator()) is just a manual version of my suggestion. I don't think there is an appreciable change in performance there. You might be able to simplify it a little bit, because as soon as you've found a "+", ">" or "~" you know you are in the branch (you don't need to scan ahead for the selector tokens). The <S> is really the only token where you're not sure yet. You could probably rewrite it by scanning for the first non-<S> token and see if it is ( <PLUS> | <GREATER> | <TILDE> | <IDENT> | <ASTERISK> | <HASH> | <DOT> | <LSQUARE> | <COLON>). That said, the lookahead could probably also be rewritten to (<S>? ( <PLUS> | <GREATER> | <TILDE> | <IDENT> | <ASTERISK> | <HASH> | <DOT> | <LSQUARE> | <COLON>)).

But all this makes no real difference in performance, because all the solutions don't rely on throw/catch. They just make the code less readable.

But you see no difference in performance between the LOOKAHEAD(2) and the others? I've tried on both Java 8 and Java 17, and for both the difference is gigantic.

rbri commented 1 year ago

Strange, i'm using openjdk8 here (and eclipse) and there was no notable difference. Will check it again in the afternoon.

svschouw-bb commented 1 year ago

How long doe the performance test take there? It could be an issue at our end?

rbri commented 1 year ago

This is the version with lookahead(2)

image

rbri commented 1 year ago

do you use any special vm or vm options?

rbri commented 1 year ago

your version image

rbri commented 1 year ago

my one image

svschouw-bb commented 1 year ago

Interesting... in maven it also runs very fast on my end... but in Eclipse very poorly.

I don't see any special vm options. Maybe "-ea", but even without it the performance is poor for lookahead(2).

svschouw-bb commented 1 year ago

Ah, found it: I always run tests in Eclipse in debug mode. The performance is generally comparable, and that way I can always debug things by setting breakpoints. But apparently in that way it does not optimize throw/catch flows.

I found another source of poor performance in our testsuite that was maven-specific. With that issue fixed our tests is maven run fairly quickly and the CSS3Parser fix hardly makes any difference at all.

So then it's only a matter of whether debug-performance matters enough to justify this fix.

rbri commented 1 year ago

Will think about it, but at least it was fun to dive (again) in the javacc mud

rbri commented 1 year ago

I think the lookahead(2) is the expected way for most readers - i like to stay with the current impl. And we have the branch for further reference.