dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.12k stars 1.57k forks source link

Evaluate whether we should always canonicalize comment tokens in lexer #51344

Open mkustermann opened 1 year ago

mkustermann commented 1 year ago

From some profiles of analyzer it seems that when the analyzer looks for doc comments it causes canonicalization of strings.

canonicalizeSubString                                                                                                                                                                                                                  
  - 1.35% StringTokenImpl.lexeme                                                                                                                                                                                                  
     - 0.25% Parser.findDartDoc                                                                                                                                                                                                   
        - AstBuilder._findComment               

Firstly the analyzer code looks like this:

Token? findDartDoc(Token token) {
  ...
  while (comments != null) {
    String lexeme = comments.lexeme;
    if (lexeme.startsWith('///')) {
      ...
    } else if (lexeme.startsWith('/**')) {
      ...
    }
  }
  ...
}

The call to comments.lexeme forces lazy lexemes to be evaluated.

There's several things here

Why need lexeme in first place?

Though actually the token class hierarchy has specific comment tokens. Why is this not

Token? findDartDoc(Token token) {
  ...
  while (comments != null) {
    if (comments is DocumentationComment) {
      ...
    }
  }
  ...
}

?

We can avoid getting lexeme

Instead of getting comments.lexeme which forces us to utf-8 decode or sub-string, we could simply look at the underlying data to answer lexeme.startsWith('///') questions - no need for utf-8 decoding / sub-string creation.

Why do we canonicalize comment tokens by-default?

Tokens are created in the string scanner:

  @override
  CommentToken createCommentToken(TokenType type, int start, bool asciiOnly,
      [int extraOffset = 0]) {
    return new CommentTokenImpl.fromSubstring(
        type, string, start, scanOffset + extraOffset, tokenStart,
        canonicalize: true);
  }

  @override
  DartDocToken createDartDocToken(TokenType type, int start, bool asciiOnly,
      [int extraOffset = 0]) {
    return new DartDocToken.fromSubstring(
        type, string, start, scanOffset + extraOffset, tokenStart,
        canonicalize: true);
  }

We can see canonicalize: true here - though most comments will hopefully be unique. It seems some simple heuristics could be applied, e.g.

Labeling as area-front-end for now.

/cc @jensjoha @johnniwinther @scheglov

mkustermann commented 1 year ago

Small addition: The reason why this matters is that analyzer is using string-based lexer and certain analyzer configurations run in AOT mode. In AOT mode we don't inline String.codeUnitAt() which will make sub-string canonicalizer work very slowly.