Closed RyanGlScott closed 1 week ago
Some assorted notes that I took while investigating this:
language-rust
's lexer implementation is based off the work in https://github.com/rust-lang/rust/pull/24620, which uses an ANTLR-based Unicode lexer.language-rust
copied over the ANTLR-based lexer tables directly into its own lexer implementation. The thing is, I'm not entirely convinced that it did so correctly. This is because the ANTLR-based tables encode Unicode characters using a UTF-16, but language-rust
's lexer is generated from alex
, which encodes Unicode characters using UTF-8. For sufficiently small character codepoints, these encodings coincide, but for larger codepoints, these are not the same.
As a specific example where this goes wrong, consider the π
character, which uses the 0x1031D codepoint. In UTF-8, this is encoded with the surrogate pair (0xD800, 0xDF1D), which should be covered by this line in language-rust
's lexer. Despite this, language-rust
is unable to lex this program:
// test.rs
fn main() {
let π = ();
π
}
$ runghc Main.hs
Left (parse failure at 3:9 (lexical error))
As such, I think language-rust
's lexer is broken for any Unicode character that requires surrogate pairs to encode in UTF-16βthat is, any character whose codepoint exceeds the value 0xFFFF.
Modern versions of rustc
no longer use the ANTLR-based lexer linked above, but instead use a completely different lexer implementation based on these tables (which, in turn, are derived from the data on the official Unicode website). Notably, these tables are not UTF-16βencoded, so they would be much easier to translate to an alex
-based lexer.
I propose that we rewrite language-rust
's lexer to be in terms of the data from the Unicode website, similarly to how rustc
's modern lexer works. The rustc
lexer implementation generates its tables using this script, so perhaps we can adapt this script to generate alex
code. Scripting this would also make it much more straightforward to upgrade Unicode versions in the future. (Currently, the script uses Unicode 15.1.0.)
Per the Rust Reference, Rust permits any identifier that meets the specification in Unicode Standard Annex #31 for Unicode version 15.0. For example,
rustc
accepts the following program:language-rust
, on the other hand, fails to lex this program:My guess is that this part of the lexer needs to be updated to support Unicode 15.0.