dwyl / cid

❄️cid ("content id") is a human-friendly (readable/typeable) unique ID function built for distributed/decentralised systems.
GNU General Public License v2.0
34 stars 3 forks source link

Why sha256 - base58 multihash is 46 characters length? #18

Open SimonLab opened 5 years ago

SimonLab commented 5 years ago

The first step of the cid decode algorithm (https://github.com/multiformats/cid#decoding-algorithm) is:

If it is 46 characters long and starts with Qm..., it's a CIDv0. Decode it as base58btc and continue to step 2

and it was not obvious to me why the character length of the CIDv0 which is a multihash is 46.

Here is my understanding of the format of the multihash:

Multihash Format

Multihashes are bytes where:

ex: 00010001 00000100 101101100 11111000 01011100 10110101

The first and second part are defined as varint (VAriable Integer). See https://github.com/multiformats/unsigned-varint and https://developers.google.com/protocol-buffers/docs/encoding#varints to understand how varints are created. In our case because there is only 1 byte for each part we can simply do a conversion from binary to decimal:

Then with the following table we can check the representation: https://github.com/multiformats/multicodec/blob/master/table.csv image

So we can deduce that the multihash 00010001 00000100 101101100 11111000 01011100 10110101 is a sha1 and the hash value length is 4 bytes or 4x8 = 32 bits sha1 normally returns a hash of 20 bytes (160 bits) but for this example we reduced the length to 4 bytes

Why CIDv0 is 46 characters length?

Now that we understand the format of a multihash we can calculate the total number of bytes in a CIDv0. a CIDv0 is a multihash where the hash function is sha2-256 (see https://github.com/multiformats/cid#versions).

From the table above we know that sha2-256 is represented with 0x12 which is 1x16^1 + 2x16^0 = 18. From the definition of varint (see links above) we have 18 < 127 so the hash function used (first part) can be encoded with 1 byte

From the definition of sha2-256 (https://en.wikipedia.org/wiki/SHA-2) we know that the hash value will be 256 bits long, ie 256/8 = 32 bytes. Again 32 < 127 so the length of the hash (part 2) is represented with 1 byte

We now have:

The the CIDv0 length will always be 34 bytes length.

The default representation of these bytes is done using base58btc. This base is build from base64 (some characters have been removed, see #2). So the multihash is represented in these based with a set of 58 characters (123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz) see https://en.wikipedia.org/wiki/Base58. So we need numbers from 0 to 57 to be able to represent this set in binary and 57 is 111001 which is 6 bits. So each digit in base58 is represented with 6 bits.

To conclude we have 34 bytes for a multihash which is *348= 272 bits and 272/6 = 45.33 so we need 46 characters to represent the multihash**

@RobStallion @nelsonic I think this is the reason why CIDv0 are 46 length, but if you think have done a mistake especially with the encoding to base58 and spliting the multihash in group of 6bits let me know :+1:

RobStallion commented 5 years ago

@SimonLab That is super interesting and very helpful to understand. Thanks 👍

nelsonic commented 5 years ago

@SimonLab good write-up on the length of CIDs. 👍 Have you taken a look at how other JS/Ex implementations are doing the Binary/Base16 to Base58BTC encoding/decoding?

SimonLab commented 5 years ago

@nelsonic this is my next step. I had a look at https://github.com/nocursor/b58 (which use multiple type of set characters, we need just one for now) and I'm going to create a simplified version now that we have manage to create a CIDv1 (in base16) matching the one from the js-cid and ex_cid packages