Open SimonLab opened 5 years ago
@SimonLab That is super interesting and very helpful to understand. Thanks 👍
@SimonLab good write-up on the length of CIDs. 👍 Have you taken a look at how other JS/Ex implementations are doing the Binary/Base16 to Base58BTC encoding/decoding?
@nelsonic this is my next step. I had a look at https://github.com/nocursor/b58 (which use multiple type of set characters, we need just one for now) and I'm going to create a simplified version now that we have manage to create a CIDv1 (in base16) matching the one from the js-cid and ex_cid packages
The first step of the
cid decode
algorithm (https://github.com/multiformats/cid#decoding-algorithm) is:and it was not obvious to me why the character length of the CIDv0 which is a multihash is 46.
Here is my understanding of the format of the multihash:
Multihash Format
Multihashes are bytes where:
ex:
00010001 00000100 101101100 11111000 01011100 10110101
00010001
represent the hash function used00000100
represent the length of the hash in bytes101101100 11111000 01011100 10110101
represents the hashThe first and second part are defined as varint (VAriable Integer). See https://github.com/multiformats/unsigned-varint and https://developers.google.com/protocol-buffers/docs/encoding#varints to understand how varints are created. In our case because there is only 1 byte for each part we can simply do a conversion from binary to decimal:
Then with the following table we can check the representation: https://github.com/multiformats/multicodec/blob/master/table.csv
So we can deduce that the multihash
00010001 00000100 101101100 11111000 01011100 10110101
is a sha1 and the hash value length is 4 bytes or 4x8 = 32 bits sha1 normally returns a hash of 20 bytes (160 bits) but for this example we reduced the length to 4 bytesWhy CIDv0 is 46 characters length?
Now that we understand the format of a multihash we can calculate the total number of bytes in a CIDv0. a CIDv0 is a multihash where the hash function is
sha2-256
(see https://github.com/multiformats/cid#versions).From the table above we know that
sha2-256
is represented with 0x12 which is 1x16^1 + 2x16^0 = 18. From the definition of varint (see links above) we have 18 < 127 so the hash function used (first part) can be encoded with 1 byteFrom the definition of sha2-256 (https://en.wikipedia.org/wiki/SHA-2) we know that the hash value will be 256 bits long, ie 256/8 = 32 bytes. Again 32 < 127 so the length of the hash (part 2) is represented with 1 byte
We now have:
The the CIDv0 length will always be 34 bytes length.
The default representation of these bytes is done using base58btc. This base is build from base64 (some characters have been removed, see #2). So the multihash is represented in these based with a set of 58 characters (123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz) see https://en.wikipedia.org/wiki/Base58. So we need numbers from 0 to 57 to be able to represent this set in binary and 57 is 111001 which is 6 bits. So each digit in base58 is represented with 6 bits.
To conclude we have 34 bytes for a multihash which is *348= 272 bits and 272/6 = 45.33 so we need 46 characters to represent the multihash**
@RobStallion @nelsonic I think this is the reason why CIDv0 are 46 length, but if you think have done a mistake especially with the encoding to base58 and spliting the multihash in group of 6bits let me know :+1: