Closed svschouw-bb closed 1 year ago
@svschouw-bb thanks for the report.
Maybe you can provide some samples to have test cases. And of course every pr is welcome.
Test cases are pretty hard to create, since this is a performance issue, not a bug. I hope the unittest set is complete enough that any refactor of the parser does not break existing CSS parsing.
there is always a test missing :-)
Will also try to have a look - maybe we can get rid of the lookahead but i have to wake up my memories first
Is there any reason to keep the whitespace in the token stream? Is there an annoying CSS feature that requires this?
Yes, there is, otherwise it is not possible to distinguish between div .foo
and div.foo
... sigh. This complicates matters.
css parsing is horrible - but i did already some minor cleanup
As far as I can tell, the following LOOKAHEAD can be used instead of LOOKAHEAD(2)
:
LOOKAHEAD(combinator() ( <IDENT> | <ASTERISK> | <HASH> | <DOT> | <LSQUARE> | <COLON> ))
Only downside is that you have to manually specify which are all the possible first tokens in simpleSelector
. The upside is that there aren't that many.
Alternatively, using LOOKAHEAD(100)
could also work. This makes sure the LookaheadSuccess
is only thrown if after 100 tokens we're still in the simpleSelector
. But parsing a 100 tokens is (probably) still faster than a throw/catch. The biggest downside is that it makes the parser considerably more complex, because it needs to create a shadow-copy of the parser of everything that can appear in a simpleSelector (with :not(<selector>)
being the most complex case).
As for unittests, with the first solution there should probably be tests for all selector combinators.
One issue I'm having is that it's not that easy to go to HtmlUnit 3.0.0, because Spring still uses the 2.xx line. I have a local branch (without tests) that is based on 1.14.0. Both solutions significantly improve the test performance of our htmlunit-based tests (the CSS parser is all but disappeared from the profiler).
Have experimented today wit a similar solution. Looks like we are on the right path :-)
Hi @svschouw-bb
have added your an my solution to a new branch - see https://github.com/HtmlUnit/htmlunit-cssparser/commit/1b63adfba840d50cc7c0f7ed409a9dbf2498c002
I can't find any real difference between all these implementation but maybe you have some more test cases. Please have a look....
The difference is gigantic to me. I've turned the number of iterations on the peformance test down to 5, otherwise it takes too long. With the LOOKAHEAD(2)
implementation it takes about 50 seconds to run. With the other two it takes about 5 seconds on my machine.
I think version 3 (isCombinator()
) is just a manual version of my suggestion. I don't think there is an appreciable change in performance there. You might be able to simplify it a little bit, because as soon as you've found a "+", ">" or "~" you know you are in the branch (you don't need to scan ahead for the selector tokens). The <S>
is really the only token where you're not sure yet. You could probably rewrite it by scanning for the first non-<S>
token and see if it is ( <PLUS> | <GREATER> | <TILDE> | <IDENT> | <ASTERISK> | <HASH> | <DOT> | <LSQUARE> | <COLON>)
. That said, the lookahead could probably also be rewritten to (<S>? ( <PLUS> | <GREATER> | <TILDE> | <IDENT> | <ASTERISK> | <HASH> | <DOT> | <LSQUARE> | <COLON>))
.
But all this makes no real difference in performance, because all the solutions don't rely on throw/catch
. They just make the code less readable.
But you see no difference in performance between the LOOKAHEAD(2)
and the others? I've tried on both Java 8 and Java 17, and for both the difference is gigantic.
Strange, i'm using openjdk8 here (and eclipse) and there was no notable difference. Will check it again in the afternoon.
How long doe the performance test take there? It could be an issue at our end?
This is the version with lookahead(2)
do you use any special vm or vm options?
your version
my one
Interesting... in maven it also runs very fast on my end... but in Eclipse very poorly.
I don't see any special vm options. Maybe "-ea", but even without it the performance is poor for lookahead(2).
Ah, found it: I always run tests in Eclipse in debug mode. The performance is generally comparable, and that way I can always debug things by setting breakpoints. But apparently in that way it does not optimize throw/catch flows.
I found another source of poor performance in our testsuite that was maven-specific. With that issue fixed our tests is maven run fairly quickly and the CSS3Parser fix hardly makes any difference at all.
So then it's only a matter of whether debug-performance matters enough to justify this fix.
Will think about it, but at least it was fun to dive (again) in the javacc mud
I think the lookahead(2) is the expected way for most readers - i like to stay with the current impl. And we have the branch for further reference.
Even with the recently improved
:not
handling the performance of the CSS parser is still really poor. Over 50% of the CPU of our HtmlUnit tests is spent parsing CSS, and if I interpret the profiler data correctly the cause is theLOOKAHEAD(2)
in CSS3Parser.jj:1111JavaCC seems to implement
LOOKAHEAD(2)
by just running a side-effect free version of the parser, and as soon as 2 tokens have been correctly parsed, it throws a pre-constructedLookaheadSuccess
throwable, restores the parser state and starts parsing for real. Throwing exceptions is expensive and the method that throws this exception is really high in the "self-time" of the profiler (actually the method itself is nowhere to be found, but various other methods that call this method are; I presume the compiler inlines thejj_scan_token(int)
method that does it).My guess is that the
LOOKAHEAD(2)
is there to differentiate betweendiv span
(descendant selector) anddiv {
. Since the presence of whitespace after the selector (div
) does not tell you whether you are done parsing selectors yet. My experience with parsers is that in general you want to skip all whitespace from the token stream. This way instead of having to differentiate between<IDENT> <S> <IDENT>
vs<IDENT> <S> <LBRACE>
you differentiate between<IDENT> <IDENT>
vs<IDENT> <LBRACE>
, making the parser LL(1), which should improve the performance a lot.Is there any reason to keep the whitespace in the token stream? Is there an annoying CSS feature that requires this?
If I have time I might look into creating a pull request, but my experience with JavaCC is mostly theoretical.