UTF-8 support - Githubissues

LilithSilver commented 2 years ago

Currently, there is a bug with parsing UTF-8 or ASCII Extended characters: the C call isspace() doesn't accept negative char values. The simple fix is to cast the value to an unsigned char, which is fine because no ASCII spaces can appear in the negatives of a char anyways.

This PR also adds a test based on a modified version of Markus Kuhn's UTF-8 Demo Page, to ensure that it can parse a variety of characters. The demo is under the CC BY license which allows unrestricted use with attribution, and the attribution is at the top of the file, so we should be good there.

JBenda commented 2 years ago

Oh, is this really the only thing that breaks with utf-8? quity handy.

I will try it my self this weekend, but it looks promissing.

Thanks for the input

LilithSilver commented 2 years ago

Yep, I was surprised as well, but it makes sense considering that UTF-8 was designed for full ASCII compatibility!

Note that if you want the UTF-8 to display properly, you'll have to reinterpret the byte data as UTF-8. Visual Studio for example doesn't support UTF-8 and outputs strings as garbled ASCII extended. But the test confirms that the byte data produced by ink is indeed correct.

JBenda / inkcpp

UTF-8 support #52