Does it matter that the code all works with code units, not code points?

The code always progresses through text (in the glob pattern, and also in the input) one char at a time, with no regard for higher level units. For example, codepoints outside of the basic multilingual plane (BMP) are encoded as pairs of char value, but instead of treating these as a single character, the two halves of the surrogate pair get handled as separate characters.

It's possible this doesn't matter, but it would be good to add some tests for cases where either the pattern, the input, or both contain non-BMP characters.

(Another consideration is where multiple codepoints combine to form a single logical form, e.g., combining diacritics. These things raise questions of whether you want to treat "caf\u00e9"' and"cafe\u0301"as equal—both representcaféone with the unicode codepoint that pre-combinesewith an acute accent, and the other using an ordinarye` with a combining accent. The answer, mostly likely, is that we do not want to support such things, but it would be good to be explicit, and possibly even to have tests that call this out.)

corvus-dotnet / Corvus.Globbing

Does it matter that the code all works with code units, not code points? #6