boinkor-net / chars

cha(rs) is a commandline tool to display information about unicode characters
https://github.com/boinkor-net/chars
MIT License
183 stars 13 forks source link

Suggestion: Unicode version codepoint was added #48

Open wezm opened 4 years ago

wezm commented 4 years ago

I deal with Unicode a fair bit and chars is a handy tool. Sometimes it would be convenient to know which Unicode version assigned a particular codepoint.

E.g the output from chars might look something like this. The version information might not be shown by default and require a command line flag if it was deemed too noisy.

$ chars party
U+0001F973, 🥳 0x0001F973, \0374563, UTF-8: f0 9f a5 b3, UTF-16BE: d83edd73
Width: 2, prints as 🥳
Quotes as \u{1f973}
Unicode name: FACE WITH PARTY HORN AND PARTY HAT
Unicode version: 11.0

U+0001F389, 🎉 0x0001F389, \0371611, UTF-8: f0 9f 8e 89, UTF-16BE: d83cdf89
Width: 2, prints as 🎉
Quotes as \u{1f389}
Unicode name: PARTY POPPER
Unicode version: 6.0

I think the information is available via the DerivedAge.txt file in the UCD.

antifuchs commented 4 years ago

This is a marvelous idea! Thanks for submitting it! :D

I'm not sure I can take a look at this in the next few weeks, but would love to have this feature. If you want to take a stab at it, I can probably give you enough guidance to get you started, though (:

wezm commented 4 years ago

I might be able to take a look on the weekend. Did you have and preferences/thoughts regarding whether the version information was output by default?

antifuchs commented 4 years ago

I think showing the version unconditionally would be just fine - chars is somewhat aggressively non-configurable and maximally informative for human users, so just adding it would work well (:

To add this feature, I think it's a two/three step process:

  1. you'd add a task to fetch data file to the chars_data subcrate in the chars workspace here,
  2. update write_name_data in the unicode portion to emit another table giving unicode versions & the ranges added in them (ideally make it a memory-optimized data structure; I don't extremely mind searching through n*13ish unicode versions for each character, but would be worried if we added a table mapping each character to a version number... maybe there's something one could do with tries though?)
  3. Update the Codepoint Display impl's branch for Unicode here-ish to show the version number.

...and that's about it, I think! The main difficulty will probably be making a parser for that data file (the ones I made I got by with making a regex-based one, but feel free to use any other reasonable method, tbqh) and finding a decently space-efficient repr for the version table. Best of luck!

wezm commented 4 years ago

I made a start on this yesterday. I'm 50–75% done. Fortunately I think what you described above matches what I did/planned to do 😃

antifuchs commented 4 years ago

That's fantastic to hear - excited to see what you came up with (: