`language-rust` lexer rejects Unicode symbols that `rustc` accepts

Some assorted notes that I took while investigating this:

language-rust's lexer implementation is based off the work in https://github.com/rust-lang/rust/pull/24620, which uses an ANTLR-based Unicode lexer.
It's hard to tell what version of Unicode this was based on, but this comment suggests it is around Unicode 4.0 or so.
language-rust copied over the ANTLR-based lexer tables directly into its own lexer implementation. The thing is, I'm not entirely convinced that it did so correctly. This is because the ANTLR-based tables encode Unicode characters using a UTF-16, but language-rust's lexer is generated from alex, which encodes Unicode characters using UTF-8. For sufficiently small character codepoints, these encodings coincide, but for larger codepoints, these are not the same.

As a specific example where this goes wrong, consider the 𐌝 character, which uses the 0x1031D codepoint. In UTF-8, this is encoded with the surrogate pair (0xD800, 0xDF1D), which should be covered by this line in language-rust's lexer. Despite this, language-rust is unable to lex this program:
```
// test.rs
fn main() {
  let 𐌝 = ();
  𐌝
}
```
```
$ runghc Main.hs 
Left (parse failure at 3:9 (lexical error))
```
As such, I think language-rust's lexer is broken for any Unicode character that requires surrogate pairs to encode in UTF-16—that is, any character whose codepoint exceeds the value 0xFFFF.
Modern versions of rustc no longer use the ANTLR-based lexer linked above, but instead use a completely different lexer implementation based on these tables (which, in turn, are derived from the data on the official Unicode website). Notably, these tables are not UTF-16–encoded, so they would be much easier to translate to an alex-based lexer.

I propose that we rewrite language-rust's lexer to be in terms of the data from the Unicode website, similarly to how rustc's modern lexer works. The rustc lexer implementation generates its tables using this script, so perhaps we can adapt this script to generate alex code. Scripting this would also make it much more straightforward to upgrade Unicode versions in the future. (Currently, the script uses Unicode 15.1.0.)

GaloisInc / language-rust

`language-rust` lexer rejects Unicode symbols that `rustc` accepts #3