Open matthew-dean opened 1 year ago
So, I think part of the performance hit was actually (possibly?) related to: #6. I'm not sure what I did, but after some refactoring, I'm not getting such a massive performance drain when building errors? I still feel like somehow the number of alternations should be capped somehow 🤔
@msujew Okay, I've narrowed this down quite a bit, and I've discovered it's not just a performance issue upon errors, but in some particular parsing paths cause hugely degraded performance in this package. (I'll happily concede that it may actually be a mistake in the grammar, but the effect of that is that no errors / validation problems are thrown, and this occurs when a file is parsed correctly.)
I've frozen a branch here.
If you check it out and do a pnpm install
in the root, and then build css-parser
(run pnpm build
in the css-parser folder, you can go into the less-parser
branch and run pnpm test
.
I've narrowed down to a single test-case. It's a .less
file that's actually very short. However, by shadowing the LLStarLookaheadStrategy
class and logging functions, I've discovered that chevrotain-allstar
can spend up to 12 seconds calling the function returned by buildLookaheadForAlternation
. This results in 20 seconds (on an M1 Mac) parsing a file, the contents of which are here.
Any idea what's going on here?
I'm going to try to take a look tomorrow to see if I can set some breakpoints and do some logging and figure out what exactly is happening in that function that is taking 12 seconds.
@matthew-dean I cannot reproduce the performance degradation you're experiencing (on a few years old Intel). I wrote a small test for the file in question:
describe.only('can parse colors.less file quickly', () => {
console.time('colors.less');
const result = fs.readFileSync(__dirname + '/css/colors.less', 'utf8')
const { cst, lexerResult, parser } = cssParser.parse(result)
expect(lexerResult.errors.length).toBe(0)
expect(parser.errors.length).toBe(0)
console.timeEnd('colors.less');
})
And received:
colors.less: 63.231ms
Seems pretty slow, but that's mostly due to reading from disk and some warmup in the lookahead. The parser can probably be optimized quite a bit. Reading the file a second time right after the first call yields:
colors.less: 5.938ms
Note that I had to change the comment terminal for a successful parse to also tokenize // single line comments
correctly:
-['comment', '\\/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*\\/'],
+['comment', '(\\/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*\\/|\\/\\/[^\\n\\r]*)'],
@msujew That's the wrong parser. The CSS Parser doesn't demonstrate that slow-down. It was the Less Parser. That's why comments were not parsing in colors.less
, as CSS doesn't support line comments.
Here's what I've discovered so far.
This package really grinds to a halt with recursion, because it will predict paths per alt that encompass the remainder of the file.
Originally, I had this, which matches the CSS spec exactly:
$.RULE('declarationList', () => {
$.OR({
DEF: [
{
ALT: () => {
$.OPTION(() => $.SUBRULE($.declaration))
$.OPTION2(() => {
$.CONSUME(T.Semi)
$.SUBRULE3($.declarationList)
})
}
},
{
ALT: () => {
$.SUBRULE($.innerAtRule)
$.SUBRULE($.declarationList)
}
},
{
ALT: () => {
$.SUBRULE($.qualifiedRule, { ARGS: [{ inner: true }] })
$.SUBRULE2($.declarationList)
}
}
]
})
})
This mirrors the railroad diagrams for a declaration list.
This didn't cause a problem until extending into the Less Parser. Less / Sass allow identifiers for inner qualified rules, which means long-chain ambiguous rule starts (a:b d e f g h i j k l m n o p
is either an entire declaration, or the start of an inner qualified rule in Less/Sass).
The combination of those two things is what caused chevrotain-allstar
to bork entirely and take 20 seconds to parse a file.
I intuited what this package was doing when I had logging turned on and the "ambiguous alternatives" warning was not only long, but contained all tokens for the remainder of the file, which it termed as prefixes.
I re-tooled the above declarationList to look like this, which has the exact same parsing outcome, but eliminates recursion:
$.RULE('declarationList', () => {
let needsSemi = false
$.MANY({
GATE: () => !needsSemi || (needsSemi && $.LA(1).tokenType === T.Semi),
DEF: () => {
$.OR([
{
ALT: () => {
$.SUBRULE($.declaration)
needsSemi = true
}
},
{
ALT: () => {
$.OR2([
{ ALT: () => $.SUBRULE($.innerAtRule) },
{ ALT: () => $.SUBRULE2($.qualifiedRule, { ARGS: [{ inner: true }] }) },
{ ALT: () => $.CONSUME(T.Semi) }
])
needsSemi = false
}
}
])
}
})
})
I guess I sort of mistakenly figured that this package would handle ambiguity, because it handles longer-path prediction, but anytime ambiguity happens, chevrotain-allstar
shows a noticeable uptick in processing time. When it's ambiguity + recusion, then it really falls apart.
The good news is that with that refactoring, speed isn't as much as an issue. Parsing of colors.less
is, right now, around 58ms. ~I still have weird ambiguity errors where there isn't any ambiguity, AFAICT, but I can discuss in a separate issue.~ (Derp, nope, there was ambiguity lol.)
I appreciate your help!
I see, good to know 👍
Yeah, as mentioned in the other issue, having ambiguities present (especially those with long prefixes) in your grammar can be quite the performance bottleneck. While chevrotain-allstar fixes some of the ambiguity issues appearing in the LL(k) strategy, there are some true ambiguities that just cannot be resolved until they are fully evaluated.
I assume with the recursion case, you actually hit the theoretical worse case boundary of the ALL(*) algorithm. Normally, ALL(*) is supposed to operate in O(n), but the worst case makes it explode into O(n^4). It's outlined in the original paper.
@msujew
While chevrotain-allstar fixes some of the ambiguity issues appearing in the LL(k) strategy, there are some true ambiguities that just cannot be resolved until they are fully evaluated.
This is really good to know! Thank you!
One thing I've noticed when using this lookahead strategy is that while the successful parses are very fast, the failures are orders of magnitude slower than Chevrotain's default. I thought maybe it was during error logging, but I built a custom error message provider, and that had no effect.
In some of my debugging, I think this is because the alternations that are built can contain hundreds of entries by the time it's sent to
buildNoViableAltMessage
, whereas Chevrotain's default lookahead strategy might produce 10 for the sameOR
block.There's obviously no benefit for an error message handler to receive hundreds or thousands of possible paths as there's no error message that could include all of them that would be useful in that circumstance, so could that be scaled back to a max limit of paths? (Maybe configurable?)
Addendum
So, I've found another side effect of this problem which is backtracking. I know Chevrotain recommends against backtracking if at all possible, but because I had a custom error message provider, I kept seeing
buildNoViableAltMessage
called but then no error thrown. I figured out that error messages are built even if they aren't used. So, backtracking, which is already warned may have poor performance, is exacerbated greatly when using backtracking + chevrotain-allstar. It takes Chevrotain a long time to recover from a backtracking rule because the thread is locked by this package building expected paths.