Closed jwodder closed 9 months ago
It also shouldn't print <Private Use, Last>
, but just <Private Use>
. Need to look in to that too (just a note for myself).
By default uni avoids printing unprintable stuff to the terminal, but you can use -r
or -raw
to override that:
% uni -q i $'\uf8ff' -f '%(char)' | uni i -r
CPoint Dec UTF8 HTML Name (Cat)
'�' U+FFFD 65533 ef bf bd � REPLACEMENT CHARACTER (Other_Symbol)
% uni -rq i $'\uf8ff' -f '%(char)' | uni i -r
CPoint Dec UTF8 HTML Name (Cat)
'' U+F8FF 63743 ef a3 bf  <Private Use, Last> (Private_Use)
This is to prevent printing control characters, combining characters, and other stuff that can mess with the output. For know values we do something reasonable like displaying a graphic, but for unknown values that's a bit harder. If the category is not Control, Letter, Mark, Number, Punctuation, or Symbol it will print U+FFFD as "I don't know what to do with this".
On one hand this makes sense: "better safe than sorry, and override with -raw", and %(char) is meant for visual display first, rather than outputting the exact byte sequence.
On the other hand it's also confusing. I had to look this up myself and I forgot it worked like this. But I'm also not entirely sure how to do it better. In hindsight it would have been better to have %(display_char) and %(char) as separate format specifiers or something, but the -r flag predates the customizable output, and changing it now would break people's scripts.
Private Use is never safe to print, as we don't know what any of it is. I suppose we could support some common private use characters as an exception (that Apple icon actually works on my Linux machine, in my terminal anyway, but not Firefox), but that will probably make things more confusing in other cases.
In general, I'm leaning towards "leave it as-is, but document this better".
Private Use is never safe to print, as we don't know what any of it is.
I would expect (and I look forward to being corrected on this) that the various meanings assigned to private use characters have all been as "normal" graphical characters. They're not control-control characters like LF, they're not going to change your terminal's background color, they're just going to display either as some medieval ligature or as your font's "unknown character" box.
Private Use can't have control characters that have a very special meanings such as carriage return, left-to-right control stuff, and whatnot, but I think the font can do $anything fonts can do, right?
Specifically, things like combining characters, zero-width characters, ligatures, and things like that.
Maybe it's not really a practical issue; but if things do go wrong then there's no way to override it (there's no opposite of the -raw flag, and I'd rather not add that just for this).
I'm also not sure what else gets covered in this check; should probably look at that. The intent was "Other unprintable characters use the replacement character � (U+FFFD)". This covers things such as U+200E LEFT-TO-RIGHT MARK and such, but maybe we can print those better and then remove this. It was never really my intent to cover Private Use with this as such.
I had dinner and a think, and I'll change it to always print Private Use characters as-is. Looking around how people are using Private Use plane that should be safe, and even when it's not, I think it's an edge case people will rarely encounter. Can always change it later if people experience problems.
Will use open box (␣, U+2423) for every other control character. U+FFFD is too confusing and looks like a bug.
When I use
uni
to query information on a private use character — such as the Apple logo, defined as U+F8FF on Apple computers —uni
actually outputs U+FFFD, the replacement character. For example:The command above ends with U+F8FF, but the character output by
uni
is�
. Evidence:ef bf bd
is the UTF-8 encoding for U+FFFD, not for U+F8FF.