arp242 / uni

Query the Unicode database from the commandline, with good support for emojis
MIT License
790 stars 19 forks source link

Exclude multi-codepoint html entities #43

Closed m-cz closed 1 year ago

m-cz commented 1 year ago

Some html entities like fj consist of multiple codepoints (f and j). Since we're mapping single codepoints to entities, this makes uni think that the html entity for f (0x66) is fj, and that's not correct.

before:

$ uni identify f=⋘̸                                                                                                      
     CPoint  Dec    UTF8        HTML       Name (Cat)
'f'  U+0066  102    66          fj    LATIN SMALL LETTER F (Lowercase_Letter)
'='  U+003D  61     3d          =⃥      EQUALS SIGN (Math_Symbol)
'⋘'  U+22D8  8920   e2 8b 98    ⋘̸      VERY MUCH LESS-THAN (Math_Symbol)
'◌̸'  U+0338  824    cc b8       ̸    COMBINING LONG SOLIDUS OVERLAY (Nonspacing_Mark)

after:

$ ./uni identify f=⋘̸                                                                                                    
     CPoint  Dec    UTF8        HTML       Name (Cat)
'f'  U+0066  102    66          f     LATIN SMALL LETTER F (Lowercase_Letter)
'='  U+003D  61     3d          =   EQUALS SIGN (Math_Symbol)
'⋘'  U+22D8  8920   e2 8b 98    ⋘       VERY MUCH LESS-THAN (Math_Symbol)
'◌̸'  U+0338  824    cc b8       ̸    COMBINING LONG SOLIDUS OVERLAY (Nonspacing_Mark)
arp242 commented 1 year ago

Seems good; thanks!