corvus-dotnet / Corvus.Globbing

A zero allocation globbing library
Apache License 2.0
17 stars 1 forks source link

Does it matter that the code all works with code units, not code points? #6

Open idg10 opened 2 years ago

idg10 commented 2 years ago

The code always progresses through text (in the glob pattern, and also in the input) one char at a time, with no regard for higher level units. For example, codepoints outside of the basic multilingual plane (BMP) are encoded as pairs of char value, but instead of treating these as a single character, the two halves of the surrogate pair get handled as separate characters.

It's possible this doesn't matter, but it would be good to add some tests for cases where either the pattern, the input, or both contain non-BMP characters.

(Another consideration is where multiple codepoints combine to form a single logical form, e.g., combining diacritics. These things raise questions of whether you want to treat "caf\u00e9"' and"cafe\u0301"as equal—both representcaféone with the unicode codepoint that pre-combinesewith an acute accent, and the other using an ordinarye` with a combining accent. The answer, mostly likely, is that we do not want to support such things, but it would be good to be explicit, and possibly even to have tests that call this out.)

mwadams commented 2 years ago

This is interesting, because we do support this in JSON schema land (because it is part of the optional Unicode support) but it would be slower (you'd have to use the codepoint iterator thing).

Specs that demonstrate that it does not work (i.e. pass on failure) in their own section would be good.