Open issuefiler opened 3 years ago
value
right after getting out of the for
loop.🤔 Isn't this an encoding issue? (it says "to be encoded"). In other implementations (i.e. python) this is also ignored:
$ python3 -c 'import base64; print(base64.b32decode("MY======").hex())'
$ python3 -c 'import base64; print(base64.b32decode("MZ======").hex())'
↑ Both lines show the output 66
, indicating that bits are dropped.
@martinheidegger I could misunderstand something, but I think @issuefiler is on to a fundamental issue here.
Crockford's Base 32 is unfortunately under specified, but it's described as a textual encoding of numbers where each encoded character represents 5 bits, whereas regular Base 32 is described as a representation of arbitrary sequences of octets. Crockford Base 32 never mentions octets or bytes.
So I'd expect 14 to have a Crockford Base 32 encoding, E
(and for E
to decode to 14). Symbol value 14 => encode symbol E
.
Regular base 32 seems to say that two characters followed by six =
padding characters, as in your examples, represents input of exactly 8 bits. So is there an input value that would encode to MZ======
in regular base 32?
With Crockford Base 32 the zero-extension does apply to input, so 14 (1110
) would be padded to 5 bits on input (01110
). But buffers only deal in multiples of 8 bits. So any number that takes <= 5 bits to represent and is encoded as Crockford Base 32 should decode to a buffer with a single byte, right? In the case of 14 / E
: 00001110
.
Every character in base32 encodes 5 bits (not an octet or byte).
0011010100100101
(length=16, 2 bytes)
Given ↑ is the input, it is not a multiple of 5. How i understand the spec it means that it the encoder should assume the last bits as 0.
00110101001001010000
(length=20)
char1 | char2 | char 3 | char4 |
---|---|---|---|
00110 |
10100 |
10010 |
10000 |
and represents this as c-base32 digits.
if the encoder decides to do something different here, i.e. fill it with 1s:
00110101001001011111
then the encoder did something wrong. The decoder could treat this either as malformed string or simply drop the last bits, which is usually the friendlier and faster option.
Thanks for the reply and illustration.
Every character in base32 encodes 5 bits (not an octet or byte).
Agreed:
each encoded character represents 5 bits
This is the problem with a "spec" that's vague, informal, and doesn't contain even a single example :/
So your interpretation is that decoding has to be done left to right, and should discard bits that are less than a full byte? And that the shortest possible encoded value is 2-characters?
I can see how that's easier from the perspective of working with buffers or streaming. It's less helpful from the perspective of maximizing the amount of data encoded in a certain string length. E.g. say you want to encode 70 bits of data. There's a way you could encode that in 14 characters. But the way you interpret Crockford Base 32 you'd need a 72-bit buffer as input that would get extended to 75 bits and encoded as 15 characters, right?
The same discussion unfolds in this article and the comments, and some of the comments suggest that both interpretations are useful and could co-exist, but then there's not just one Crockford Base 32 encoding.
I think it would be good to document the approach that's implemented here.
Another thing is that with that approach you have limited control over the output size, right? Like there's no way to get an encoded string that's 14 characters in length, right? Because you'd either have 64 bits / 8 bytes of input that gets extended to 65 bits and encoded as 13 characters, or 72 bits / 9 bytes of input that gets extended to 75 bits and encoded as 15 characters, right?
It is my strong assumption that Crockford built on rfc4648.
"decoding has to be done left to right"
Yes. Following the rfc:
presumed to be ordered with the most-significant-bit first
... should discard bits that are less than a full byte?
Yes. As base32 encoding is made for octets/bytes in the rfc:
The Base 32 encoding is designed to represent arbitrary
sequences of octets 👈 in a form that needs to be
case insensitive but that need not be human readable.
This means binary data that does not fall into octets can not be accurately represented by it. In Node.js we don't have a Binary structure that specifies the amount of used bits (only bytes).
... shortest possible encoded value is 2-characters?
Exactly.
I think it would be good to document the approach that's implemented here.
I will review a PR. :wink:
It is my strong assumption that Crockford built on rfc4648.
I imagine it was influenced by other stuff out there, but I'm not sure it has the same focus. E.g. it's described as an encoding of numbers (not octets) and emphasizes being human readable, compact, and pronounceable, talking about reading the symbols over the phone. So I'm not sure it's as focused on the use case of encoding arbitrarily long sequences of bytes, though of course it can.
In any case, I'm not sure the octet oriented approach, and therefore this library, is a good fit for my use case where maximizing density while having control over output length is beneficial and where I don't have to worry about encoding arbitrarily long sequences of bytes.
Thanks for discussing!
Just one thing to amend: Crockford mentioning Base 10/64/32 in the spec is the basis for my assumption.
Using Base32/64 character sets to encode/decode numbers is certainly an interesting concept. Using BigInt
being a good base for this. However you will likely need to go with the "least-significant-bit first" approach to this. I'd also call it something different to distinguish it from traditional base32 encoding.