fleschutz / Base256U

C++ sample implementation of Base256U (base256 encoding using Unicode characters).
Creative Commons Zero v1.0 Universal
8 stars 0 forks source link

Use more ascii characters for more efficient uft-8 encoding #2

Open KeinNiemand opened 1 month ago

KeinNiemand commented 1 month ago

If you want your base256 encoding to be as efficient as possible you should use every available printable ascii character. In UTF-8 Ascii character is represented in 1 byte while anything beyond ascii is 2 bytes, so using all 94 (without space) or 95 (with space) and only using Unicode characters in the alphabet after the from 95-255 or 96-255..

fleschutz commented 1 month ago

Yes, using more ASCII characters would be more efficient (1 byte vs 2 bytes). However, I decided against it for 2 reasons:

  1. Just small data is typically represented in Base256, e.g. 8/16/32/64/128 bytes for passwords/hashes/etc. Therefore, efficiency is not top priority.
  2. Please note that non-terminal characters have been used only for Base256 to support double-clicking for copy&paste. With terminal characters (+-/*.,%&=?...) this would be much more complicated and error-prone. Just think of period or commata at the end of Base256 data.