boinkor-net / chars

cha(rs) is a commandline tool to display information about unicode characters
https://github.com/boinkor-net/chars
MIT License
182 stars 13 forks source link

Suggestion: Include HTML character entity reference names in output and in search #22

Open ctsrc opened 4 years ago

ctsrc commented 4 years ago

With your tool it is possible to look up unicode characters by various criteria as you've stated in your readme, including "unicode name" and "also known as".

In HTML, named character escape sequences are available for things like the less than and the greater than signs, but also for quite a few other characters.

Back in the day, before UTF-8 encoding support was widespread, we'd use the ISO-8859-1 encoding for our HTML and we'd use named character escape sequences for characters like æ, ø, å for example.

Some of those names stuck with me and I sometimes search for those characters by those names on Google if I am on a machine where inputing said characters directly is not possible or just too cumbersome.

Even on my MacBook Air, where I can generally long-press certain keys to access other characters, some applications implement text input that does not support the long-press functionality, so I go to some other window on-screen and either long-press there, or search for it on Google whichever is most convenient at the time (convenience in this case is determined by which other windows I happen to have on screen at that moment).

I pretty much always have at least one terminal window open at any time, and if I don't then opening the terminal is fast and simple.

Prior to purchasing my MacBook Air, when I was running Linux on a ThinkPad, I made a few simple shellscripts that were named after the HTML character entity references for the characters that I most commonly needed; æ, ø, å, Æ, Ø, Å; aelig, oslash, aring, AElig, Oslash, Aring. When executed they would spit out the corresponding UTF-8 encoded byte sequence for the character in question.

oslash
ø

A full list of all HTML character entity references can be found at https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML

Most notably for me personally, aside from the six mentioned above are laquo, raquo, ndash, mdash, eacute and Eacute, but they are all useful IMO and anyway if you agree to include the HTML character entity reference names then it would make the most sense to include them all I think.

So to get to the point, my suggestion is that based upon the table at https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML, an additional field be added for applicable characters in the output for chars.

Some examples of what the output of chars would look like:

Example 1

chars U+002A
ASCII 2/a,  42, 0x2a, 0052, bits 00101010
Width: 1, prints as *
Unicode name: ASTERISK
Also known as: Star, Splat, Aster, Times, Gear, Dingle, Bug, Twinkle, Glob
HTML entity names: ast, midast

Example 2

chars U+00AE
LATIN1 ae, 174, 0xae, 0256, bits 10101110
Width: 1 (2 in CJK context), prints as ®
Quotes as \u{ae}
Unicode name: REGISTERED SIGN
HTML entity names: reg, circledR, REG

Example 3

chars U+00C6
LATIN1 c6, 198, 0xc6, 0306, bits 11000110
Width: 1 (2 in CJK context), prints as Æ
Upper case. Downcases to æ
Quotes as \u{c6}
Unicode name: LATIN CAPITAL LETTER AE
HTML entity name: AElig

In the examples above, a field named "HTML entity names" (where multiple names exist) or "HTML entity name" (where only one name exists) has been added.

Furthermore, I request that case-sensitive search is performed on this field where present, so that one can search for them and get results like shown in the following examples:

Example 1

chars Oslash
LATIN1 d8, 216, 0xd8, 0330, bits 11011000
Width: 1 (2 in CJK context), prints as Ø
Upper case. Downcases to ø
Quotes as \u{d8}
Unicode name: LATIN CAPITAL LETTER O WITH STROKE
HTML entity name: Oslash

Example 2

chars oslash
LATIN1 f8, 248, 0xf8, 0370, bits 11111000
Width: 1 (2 in CJK context), prints as ø
Lower case. Upcases to Ø
Quotes as \u{f8}
Unicode name: LATIN SMALL LETTER O WITH STROKE
HTML entity name: oslash
ctsrc commented 4 years ago

Better than the Wikipedia list I initially linked to would be to use the official list of character entities at https://html.spec.whatwg.org/multipage/named-characters.html