arp242 / uni

Query the Unicode database from the commandline, with good support for emojis
MIT License
802 stars 19 forks source link

Search by utf8? #25

Closed kbd closed 3 years ago

kbd commented 3 years ago

Recently had a problem with some code I copied from a coworker from Slack. For some reason, lines showed up as having been changed in git even though I couldn't see what was different. Put it through a hex editor and saw e2 80 8b. Went to my usual tool for this type of thing, FileFormat.info, typed that in, and it came up with the right answer, that there were zero-width spaces inserted.

I'd like to be able to use uni to search by utf8 text like that.

arp242 commented 3 years ago

This is pretty much what uni identify is:

[~]% uni identify asd
     cpoint  dec    utf8        html       name (cat)
'a'  U+0061  97     61          a     LATIN SMALL LETTE… (Lowercase_Lett…)
's'  U+0073  115    73          s     LATIN SMALL LETTE… (Lowercase_Lett…)
'd'  U+0064  100    64          d     LATIN SMALL LETTE… (Lowercase_Lett…)

Essentially it's a "UTF-8 hexdump".

kbd commented 3 years ago

I'm talking about something like:

$ uni identify --utf8 "e2 80 8b"
     cpoint  dec    utf8        html       name (cat)
'�'  U+200B  8203   e2 80 8b    ​ ZERO WIDTH SPACE (Format)
arp242 commented 3 years ago

You could just copy the code from Slack to uni, right? That's how I use it anyway.

I suppose some syntax could be added to print; I'm not sure if it's a common use case, and I'm not likely to work on it any time soon, but I'll happily review and merge patches, or I'll probably take it up eventually.

Personally I'd just pipe it to grep (uni p all | grep 'e2 80 8b') in the rare case I'd want it, which is a wee bit slow, but works well enough.

kbd commented 3 years ago
$ uni p all | rg 'e2 80 8b'
 '�'  U+200B  8203   e2 80 8b    ​ ZERO WIDTH SPACE (Format)

oh, that'll work most of the time, thanks.

arp242 commented 3 years ago

You can now use uni p 'utf8:e2 80 8b', and a few variants thereof:

$ uni p 'utf8:e2 80 8b' 'utf8:e2808b' 'utf8:0xe2 0x80 0x8b' 'utf8:e2-80-8b'
    cpoint  dec    utf8        html       name (cat)
'�'  U+200B  8203   e2 80 8b    ​ ZERO WIDTH SPACE (Format)
'�'  U+200B  8203   e2 80 8b    ​ ZERO WIDTH SPACE (Format)
'�'  U+200B  8203   e2 80 8b    ​ ZERO WIDTH SPACE (Format)
'�'  U+200B  8203   e2 80 8b    ​ ZERO WIDTH SPACE (Format)

I think that should cover all the common syntaxes; the utf8: prefix is needed to disambiguate with codepoints, since uni p 0x200B or just uni p 200B without a leading U+ will print the codepoint already.