arp242 / uni

Query the Unicode database from the commandline, with good support for emojis
MIT License
790 stars 19 forks source link

Private use characters are output as replacement characters #45

Closed jwodder closed 9 months ago

jwodder commented 9 months ago

When I use uni to query information on a private use character — such as the Apple logo, defined as U+F8FF on Apple computers — uni actually outputs U+FFFD, the replacement character. For example:

$ uni identify 
     CPoint  Dec    UTF8        HTML       Name (Cat)
'�'  U+F8FF  63743  ef a3 bf    &#xf8ff;   <Private Use, Last> (Private_Use)

The command above ends with U+F8FF, but the character output by uni is . Evidence:

$ uni identify  -c -f '%(char)' | hexdump -C
00000000  ef bf bd 0a                                       |....|
00000004

ef bf bd is the UTF-8 encoding for U+FFFD, not for U+F8FF.

arp242 commented 9 months ago

It also shouldn't print <Private Use, Last>, but just <Private Use>. Need to look in to that too (just a note for myself).


By default uni avoids printing unprintable stuff to the terminal, but you can use -r or -raw to override that:

%  uni -q i $'\uf8ff' -f '%(char)' | uni i -r
     CPoint  Dec    UTF8        HTML       Name (Cat)
'�'  U+FFFD  65533  ef bf bd    &#xfffd;   REPLACEMENT CHARACTER (Other_Symbol)

%  uni -rq i $'\uf8ff' -f '%(char)' | uni i -r
     CPoint  Dec    UTF8        HTML       Name (Cat)
''  U+F8FF  63743  ef a3 bf    &#xf8ff;   <Private Use, Last> (Private_Use)

This is to prevent printing control characters, combining characters, and other stuff that can mess with the output. For know values we do something reasonable like displaying a graphic, but for unknown values that's a bit harder. If the category is not Control, Letter, Mark, Number, Punctuation, or Symbol it will print U+FFFD as "I don't know what to do with this".

On one hand this makes sense: "better safe than sorry, and override with -raw", and %(char) is meant for visual display first, rather than outputting the exact byte sequence.

On the other hand it's also confusing. I had to look this up myself and I forgot it worked like this. But I'm also not entirely sure how to do it better. In hindsight it would have been better to have %(display_char) and %(char) as separate format specifiers or something, but the -r flag predates the customizable output, and changing it now would break people's scripts.

Private Use is never safe to print, as we don't know what any of it is. I suppose we could support some common private use characters as an exception (that Apple icon actually works on my Linux machine, in my terminal anyway, but not Firefox), but that will probably make things more confusing in other cases.

In general, I'm leaning towards "leave it as-is, but document this better".

jwodder commented 9 months ago

Private Use is never safe to print, as we don't know what any of it is.

I would expect (and I look forward to being corrected on this) that the various meanings assigned to private use characters have all been as "normal" graphical characters. They're not control-control characters like LF, they're not going to change your terminal's background color, they're just going to display either as some medieval ligature or as your font's "unknown character" box.

arp242 commented 9 months ago

Private Use can't have control characters that have a very special meanings such as carriage return, left-to-right control stuff, and whatnot, but I think the font can do $anything fonts can do, right?

Specifically, things like combining characters, zero-width characters, ligatures, and things like that.

Maybe it's not really a practical issue; but if things do go wrong then there's no way to override it (there's no opposite of the -raw flag, and I'd rather not add that just for this).

arp242 commented 9 months ago

I'm also not sure what else gets covered in this check; should probably look at that. The intent was "Other unprintable characters use the replacement character � (U+FFFD)". This covers things such as U+200E LEFT-TO-RIGHT MARK and such, but maybe we can print those better and then remove this. It was never really my intent to cover Private Use with this as such.

arp242 commented 9 months ago

I had dinner and a think, and I'll change it to always print Private Use characters as-is. Looking around how people are using Private Use plane that should be safe, and even when it's not, I think it's an edge case people will rarely encounter. Can always change it later if people experience problems.

Will use open box (␣, U+2423) for every other control character. U+FFFD is too confusing and looks like a bug.