Open mkustermann opened 1 year ago
Small addition: The reason why this matters is that analyzer is using string-based lexer and certain analyzer configurations run in AOT mode. In AOT mode we don't inline String.codeUnitAt()
which will make sub-string canonicalizer work very slowly.
From some profiles of analyzer it seems that when the analyzer looks for doc comments it causes canonicalization of strings.
Firstly the analyzer code looks like this:
The call to
comments.lexeme
forces lazy lexemes to be evaluated.There's several things here
Why need
lexeme
in first place?Though actually the token class hierarchy has specific comment tokens. Why is this not
?
We can avoid getting
lexeme
Instead of getting
comments.lexeme
which forces us to utf-8 decode or sub-string, we could simply look at the underlying data to answerlexeme.startsWith('///')
questions - no need for utf-8 decoding / sub-string creation.Why do we canonicalize comment tokens by-default?
Tokens are created in the string scanner:
We can see
canonicalize: true
here - though most comments will hopefully be unique. It seems some simple heuristics could be applied, e.g.Labeling as
area-front-end
for now./cc @jensjoha @johnniwinther @scheglov