dtolnay / unicode-ident

Determine whether characters have the XID_Start or XID_Continue properties
Apache License 2.0
76 stars 12 forks source link

Both functions can incorrectly return true #18

Closed SeedOfOnan closed 2 years ago

SeedOfOnan commented 2 years ago

In reviewing the code, I believe that is_xid_start() returns true for Unicode points above 0x32400 that have their lowest bits ranging from 0x100-0x1FF (plus various ones from 0x0-0x100), and similarly for is_xid_continue() above 0xE0200. The problem is that for unicode points that high, the variable chunk is 4 (respectively 8) giving offsets into the LEAF table other than 0 thru 0x1F, the only chunk in LEAF that is all false for 512 bits straight. A simple fix would be to change ".unwrap_or(&0)" to default to 0x11 where (it happens that) both TRIE_START[0x11] and TRIE_CONTINUE[0x11] yield zero. But after running generate, that could change? Alternatively, range checking the variable ch (or chunk) would also fix it (maybe at the cost of performance, which I expect you're aiming to avoid).

dtolnay commented 2 years ago

I believe the code is correct as written. Would you be able to give a specific example of a codepoint ('\u{?????}') where either of the functions incorrectly returns true?

SeedOfOnan commented 2 years ago

Following my formula, 0xE03FF for either function.

dtolnay commented 2 years ago

Both functions return false for that codepoint.

SeedOfOnan commented 2 years ago

Sorry, I edited my post. Did you catch that?

dtolnay commented 2 years ago

Yes.

SeedOfOnan commented 2 years ago

I'm probably confused about something, lol. Okay, now I see my mistake: TRIE_CONTINUE.0.get(1793).unwrap_or(&0) is (None.unwrap_or(&0)) = 0, so everything's fine. Sorry to bother you. I incorrectly imagined the unwrap_or() providing an index of 0 into TRIE_CONTINUE yielding 4. Derp!