carbon-language / carbon-lang

Carbon Language's main repository: documents, design, implementation, and related tools. (NOTE: Carbon Language is experimental; see README)
https://github.com/carbon-language/carbon-lang/blob/trunk/README.md
Other
32.31k stars 1.48k forks source link

Move information on the token associated with a parse node from the `.def` file into the typed node. #4001

Closed zygoloid closed 1 month ago

zygoloid commented 1 month ago

Instead of tracking the token associated with a parse node in the .def file macro, track it on the typed node instead. List the token as a field inside the node structure to show the order of the token relative to the other components of the grammar production, and to allow the token index to be accessed when the node is extracted.

Remove the corresponding information from the .def file, leaving behind just a list of parse node kinds in the majority of cases.

This also removes the checking of the token kind associated with a parse node in the case where the parse node has errors. Previously we had a flag on the node kind to indicate whether we should check this, but per discord discussion, we have decided to remove this.

jonmeow commented 1 month ago

Note, in consideration for my comments on #4000, I would suggest considering the Define comments as they apply here.

zygoloid commented 1 month ago

Note, in consideration for my comments on #4000, I would suggest considering the Define comments as they apply here.

They don't apply here -- this PR doesn't change Define now that the dependent PR has landed.

zygoloid commented 1 month ago

Had you considered instead tracking the TokenKind as part of the NodeKind definition? What are we getting by creating the separate verification path?

For me, listing the token kind at the right place in the struct makes the typed node definitions significantly more useful and readable: these structs can then be read directly as grammar productions. For example:

struct MemberAccessExpr {
  static constexpr auto Kind = NodeKind::MemberAccessExpr.Define(
      {.category = NodeCategory::Expr, .child_count = 2});

  AnyExprId lhs;
  Token<Lex::TokenKind::Period> token;
  AnyMemberNameOrMemberExprId rhs;
};

corresponds to the grammar rule

member-access-expr = any-expr "." any-member-name-or-member-expr

which we can now read directly from the struct members.

Instead of having a Token<K> wrapping a TokenIndex, we could consider having separate TokenIndex-like types for each kind of token index that has a known token kind, in the same way we do for NodeId. I'm not sure that we would get enough benefit from having more strongly typed TokenIndexes to justify the costs there, but I'll prototype it and see how it looks.

zygoloid commented 1 month ago

Instead of having a Token<K> wrapping a TokenIndex, we could consider having separate TokenIndex-like types for each kind of token index that has a known token kind, in the same way we do for NodeId. I'm not sure that we would get enough benefit from having more strongly typed TokenIndexes to justify the costs there, but I'll prototype it and see how it looks.

Here is a prototype of how that might look.

jonmeow commented 1 month ago

FWIW, for me, the prototype reads better to me: compared with NamedConstraintDefinitionStartId, Lex::CloseCurlyBraceTokenIndex also feels more consistent style-wise than Token<Lex::TokenKind::CloseCurlyBrace>. The amount of code looks similar.

I agree we don't want to spend more compute on lots of strong-typing (though maybe for this area, where we're already doing it for parse nodes, it's actually okay to do it for tokens too?).

To note, for IfValid, maybe something like:

// A lightweight handle to a lexed token in a `TokenizedBuffer` whose kind is
// known to be `Kind`.
template <const TokenKind& K>
struct TokenIndexForKindBase : public TokenIndex {
  // NOLINTNEXTLINE(readability-identifier-naming)
  static const TokenKind& Kind;
  constexpr explicit TokenIndexForKind(TokenIndex index) : TokenIndex(index) {}
};
template <const TokenKind& K>
const TokenKind& TokenIndexForKindBase<K>::Kind = K;

// Used when the kind is always `Kind`.
template <const TokenKind& K>
struct TokenIndexForKind : TokenIndexForKindBase<K> {
  static constexpr bool IfValid = false;
};
// Used when the kind is only required to be `Kind` when the corresponding parse node doesn't have an error.
template <const TokenKind& K>
struct TokenIndexForKindIfValid : TokenIndexForKindBase<K> {
  static constexpr bool IfValid = true;
};

#define CARBON_TOKEN(TokenName) \
  using TokenName##TokenIndex = TokenIndexForKind<TokenKind::TokenName>;
  using TokenName##TokenIndexIfValid = TokenIndexForKindIfValid<TokenKind::TokenName>;
#include "toolchain/lex/token_kind.def"

Also, I'd also suggest doing something with AnyToken, not sure if you just ignored it as part of the prototype. Either renaming AnyTokenIndex for consistency, or just using TokenIndex directly to avoid the extra typedef.

zygoloid commented 1 month ago

How would you feel about proceeding with this PR with the Token<K> approach, and adding the typed TokenIndexes as a follow-up change? I think there's more cleanup to do before that'll be ready to go, and I'm worried about this backsliding.