GreycLab / gmic

GREYC's Magic for Image Computing: A Full-Featured Open-Source Framework for Image Processing
Other
66 stars 11 forks source link

Multiple numbers within chars issue. #26

Closed Reptorian1125 closed 7 months ago

Reptorian1125 commented 9 months ago

As you noticed on the forum, I have figured out how to display characters correctly. However, with the custom command I have made, they don't work well because this example:

e {`'é'`}

would print multiple numbers. And my string_permutation* family commands do involve sorting characters. And the fact that this prints multiple numbers would lead to complications as my code didn't account for that, and I don't see a easy way to account for that.

dtschump commented 9 months ago

That's actually a common issue with UTF-8 strings (or more generally strings with an extended set of character than just ASCII) : It's not that easy to find the number of characters, as a single character can span over multiple bytes. If you want your command to work with UTF-8 strings, then you'll probably have to write some additional custom functions to be able to determine how many bytes each character of your string takes. Definitely requires extra-work. See : https://en.wikipedia.org/wiki/UTF-8 to see what rules you have to take into acoount.

Reptorian1125 commented 9 months ago

So, how do I find the numbers of characters and return indexes of chars into the correct string? The only solution, I can think of to my problem is to do a naive search of array. And there's the complication of not being able to verify if a file is in UTF-8 either or whether system-wide UTF-8 is enabled.

dtschump commented 9 months ago

You have basically to iterate over all the elements of the string, and determine, for each character, if it has a multi-byte representation (can take 1 to 4 bytes) :

image

This basically requires some test of the most significant bits in the first byte of the character. Not that hard, but not done already in G'MIC.

Reptorian1125 commented 9 months ago

Okay, so my plan is to bitshift to get into the most significant bits. In decimal, these numbers can be 8,12,14,15. That tells us the numbers of characters there are. Then convert the non-bitshifted numbers into decimals in order to convert into indexes which represents these characters so that they can be used for UTF-8 string processing in G'MIC. Sounds incredibly inefficient and I don't think this can be hashed easily.

Reptorian1125 commented 9 months ago

EDIT: Made a thread on this too - https://discuss.pixls.us/t/utf-8-supporting-tools-in-gmic/40523

And I decided to show my current work to do address this limitation.