Closed SeedOfOnan closed 2 years ago
I believe the code is correct as written. Would you be able to give a specific example of a codepoint ('\u{?????}'
) where either of the functions incorrectly returns true?
Following my formula, 0xE03FF for either function.
Both functions return false for that codepoint.
Sorry, I edited my post. Did you catch that?
Yes.
I'm probably confused about something, lol. Okay, now I see my mistake: TRIE_CONTINUE.0.get(1793).unwrap_or(&0) is (None.unwrap_or(&0)) = 0, so everything's fine. Sorry to bother you. I incorrectly imagined the unwrap_or() providing an index of 0 into TRIE_CONTINUE yielding 4. Derp!
In reviewing the code, I believe that is_xid_start() returns true for Unicode points above 0x32400 that have their lowest bits ranging from 0x100-0x1FF (plus various ones from 0x0-0x100), and similarly for is_xid_continue() above 0xE0200. The problem is that for unicode points that high, the variable chunk is 4 (respectively 8) giving offsets into the LEAF table other than 0 thru 0x1F, the only chunk in LEAF that is all false for 512 bits straight. A simple fix would be to change ".unwrap_or(&0)" to default to 0x11 where (it happens that) both TRIE_START[0x11] and TRIE_CONTINUE[0x11] yield zero. But after running generate, that could change? Alternatively, range checking the variable ch (or chunk) would also fix it (maybe at the cost of performance, which I expect you're aiming to avoid).