Better Unicode Support - Githubissues

cyanskies commented 1 year ago

We need a way to count stream input as unicode glyphs rather than chars. Until we have this sorted automatic line breaks and error messages will have unexpected output for lines that contain some unicode glyphs.

Ideally I want to avoid bringing in a large Unicode library, but this is probably a non-trivial challenge.

["expect_errorʤ"]
["expect_errorʤ"]

Gives the following incorrect output:

Attepted to reopen table: "expect_errorʤ", but this table has already been defined.
3>["expect_errorʤ"]
   ^~~~~~~~~~~~~~~^

Instead of:

Attepted to reopen table: "expect_errorʤ", but this table has already been defined.
3>["expect_errorʤ"]
   ^~~~~~~~~~~~~~^

cyanskies commented 1 year ago

Another issue is Unicode string equality.

We're using encoded byte equality, but some glyphs can be represented using multiple code point combinations, and so basic byte equality isn't sufficient.

EDIT: this can be solved through unicode normalisation, see: https://unicode.org/reports/tr15/. Will have to look at the algorithm to see if it's viable to implement it in this library.

cyanskies commented 1 year ago

I incorporated uni-algo to provide this capability.

cyanskies / another-toml-cpp

Better Unicode Support #9