MinisculeGirraffe / mojibake

Encode/Decode bytes as emoji base2048
https://crates.io/crates/mojibake
MIT License
15 stars 1 forks source link

Add encoding optimizing for Grapheme Clusters instead of Graphemes #3

Open esoterra opened 1 year ago

esoterra commented 1 year ago

While services often use grapheme count for character limits, the better analog for number of visual elements is grapheme clusters.

An encoding that takes advantage of zero-width joiner (ZWG) to encode grapheme clusters made of multiple graphemes (e.g. gender, skin tone modifiers) should improve the visual density of encoded information. As a bonus, this will also increase the diversity of generated emojis.

Reference

MinisculeGirraffe commented 1 year ago

I actually kind of do this already! But It could probably be cooler in the set of emojis we consider

There's about 2500 total emoji from the emoji-sequences.txt file, and we need a minimum of 2048 + 8 for tail encoding bytes.

All of the emojis that support sequence modifiers are towards the end of the file, so i'm reading the file in reverse when generating the lookup maps at compile time. That way the character set that's used will include all the permutations of gender + skin color.

This doesn't consider every possible combination of modifiers though. But a good chunk of them.