Evaluate whether we should always canonicalize comment tokens in lexer

From some profiles of analyzer it seems that when the analyzer looks for doc comments it causes canonicalization of strings.

canonicalizeSubString                                                                                                                                                                                                                  
  - 1.35% StringTokenImpl.lexeme                                                                                                                                                                                                  
     - 0.25% Parser.findDartDoc                                                                                                                                                                                                   
        - AstBuilder._findComment

Firstly the analyzer code looks like this:

Token? findDartDoc(Token token) {
  ...
  while (comments != null) {
    String lexeme = comments.lexeme;
    if (lexeme.startsWith('///')) {
      ...
    } else if (lexeme.startsWith('/**')) {
      ...
    }
  }
  ...
}

The call to comments.lexeme forces lazy lexemes to be evaluated.

There's several things here

Why need `lexeme` in first place?

Though actually the token class hierarchy has specific comment tokens. Why is this not

Token? findDartDoc(Token token) {
  ...
  while (comments != null) {
    if (comments is DocumentationComment) {
      ...
    }
  }
  ...
}

We can avoid getting `lexeme`

Instead of getting comments.lexeme which forces us to utf-8 decode or sub-string, we could simply look at the underlying data to answer lexeme.startsWith('///') questions - no need for utf-8 decoding / sub-string creation.

Why do we canonicalize comment tokens by-default?

Tokens are created in the string scanner:

  @override
  CommentToken createCommentToken(TokenType type, int start, bool asciiOnly,
      [int extraOffset = 0]) {
    return new CommentTokenImpl.fromSubstring(
        type, string, start, scanOffset + extraOffset, tokenStart,
        canonicalize: true);
  }

  @override
  DartDocToken createDartDocToken(TokenType type, int start, bool asciiOnly,
      [int extraOffset = 0]) {
    return new DartDocToken.fromSubstring(
        type, string, start, scanOffset + extraOffset, tokenStart,
        canonicalize: true);
  }

We can see canonicalize: true here - though most comments will hopefully be unique. It seems some simple heuristics could be applied, e.g.

if this is the first comment in the file, canonicalize it (it may be the copyright license)
if this is a long comment, it's probably very specific and unlikely exists multiple times

Labeling as area-front-end for now.

/cc @jensjoha @johnniwinther @scheglov

dart-lang / sdk