Open mentalisttraceur opened 3 years ago
I thought of something similar before, because I also faced the problem of "how do I make it maximally easy for myself to manually transcribe arbitrary data", so when I saw this I thought it was really cool that someone else had gone through the same process!
I never actually coded my version, but in my idea I was using the wordlist used by the pwqgen
utility (from Openwall's passwdqc
). I don't know if you considered that wordlist, but I think you might find that word list more useful for your goals, because it has a few useful properties:
In my head I was initially calling the idea "base4096 encoding", just like you called yours "base256". Then to emphasize the human-friendliness and deemphasize any implication of space-efficiency, I decided to call it "human4096 encoding". Then I realized that was pretty English-centric of me, so I'm tempted to call it "english4096". That's nice because it also leaves room for many other "{{ human language name }}4096" encodings. Or maybe some other number than 4096. But 4096 seems like a good sweet spot for distinct words that humans can keep track of which also track byte boundaries with a reasonably regular period.
Biggest differences from your implementation:
Biggest relative disadvantage of the english4096 encoding is the implementation complexity: the lack of 1-1 mapping to bytes.
(12-bit steps over 8-bit bytes is regular enough that we can still do a simple loop which reads three bytes at a time and has a special case for the two possible short reads. But it's still more code and bit-twiddling than just a simple table lookup.
Also, english4096 needs one extra word to handle those two cases in the decoder - if this modifier word is present, the word it modifies only encodes the bits of the first of the two bytes it would normally encode bits of. My intuition is to put it in front of the last word rather than behind, to make the logic simpler in the decoder, since that way the decoder doesn't have to "hold back" the bytes for a word until after checking if the next word is a modifier. You could use a modifier character or capitalization to signal this instead of an extra word, but I think capitalization is worse for human-friendliness because it complicates the mental model from the stupidly simple "just type the words you see" with extra details like "you need to be mindful of capitalization/symbols". To someone with attention to detail and computer savvy, the idea of replicating text exactly is automatic, but to a lot of lay people it is not necessarily intuitive that special characters or capitalization can't be just ignored when transcribing.)
Anyway, you've inspired me to actually implement and document english4096 encoding! Cheers!
Thank you for the kind words and for the details about your encoding.
I will keep this ticket open, as I might decide to add an implementation of your english4096
idea to my utility as well.
Love to hear that! Would be super cool to have multiple compatible implementations!
I'll try to finally publish a precise spec and some test case examples later this week, then.
Okay, here's a fuller, precise spec:
If you need a name string for the format, use english4096
(all lowercase if possible).
The word list is the 4096 words from OpenWall's passwdqc's pwqgen
's wordlist, plus "zygote".
The first 4096 words encode bit patterns from 0x000 to 0xFFF (so "aback" is 0x000, "abbey" is 0x001, ..., "will" is 0xFAB, "willow" is 0xFAC, and so on).
The canonical word list order is alphabetical. When one word is an exact prefix of another (for example "will" and "willow"), the shorter one sorts "first" (closer to A, further from Z).
Encoding operates on chunks of 24 bits: three octets of input (8 bits x 3), output two words (12 bits x 2).
For example, if the first three bytes of the input are 0x01 0x23 0x45
, that's 0x012345
, which re-splits as 0x012 0x345
, so the output is accent cruise
.
The last/4097th word, "zygote", prevents ambiguous outputs by indicating that the next word only contributes to one byte.
For example,
0x12 0x34 0x00
re-splits as 0x123 0x400
, encodes to basin drag
.0x12 0x34
re-splits as 0x123 0x4__
, encodes to basin zygote drag
.0x12 0x00
re-splits as 0x120 0x0__
, encodes to base zygote aback
.0x12
re-splits as 0x12_ 0x___
, encodes to zygote base
.Implementations should accept redundant encodings where the word "zygote" occurs in the middle of the word string, or multiple times, so long as each occurrence is individually valid.
For example, zygote base zygote crow
is the catenation of separately encoding 0x12
and 0x34
, and that whole string should decode to 0x12 0x34
. Similarly, basin zygote drag forum mast
should decode to 0x12 0x34 0x56 0x78 0x90
.
Implementations should reject unused encodings where the ignored bits are non-zero (basin zygote dragon
, basin zygote fig
, zygote basic
, zygote beacon
, ...) and where the word zygote
appears as the last word.
Decoding should be case-insensitive.
I really like the thought process behind this. Nice idea.