ethereum / solidity

Solidity, the Smart Contract Programming Language
https://soliditylang.org
GNU General Public License v3.0
23.06k stars 5.71k forks source link

IPFS hash feature use non-specified algorithm which is not widely compatible in the ecosystem #14389

Open Jorropo opened 1 year ago

Jorropo commented 1 year ago

It looks like you are implementing what looks like the Kubo defaults, they are nearing 10 years and lack the newest features we support, I thus want to change thoses so I am poking around where people rely on thoses defaults in the ecosystem.

https://github.com/ethereum/solidity/blob/develop/libsolutil/IpfsHash.cpp

Unixfs is an open format which allows for multiple writer implementations to implement their own linking logic such as append logs, content aware chunking (cutting around logical boundries in the content, such as iframes in video files, content in archive formats, ...), more packed representation, ... while all of thoses are automatically compatible with all reader implementations. This as designed lead to a inconsistent hashes in the ecosystem, examples with implementations that produce different CIDs:

Hopefully this serves as a demonstration that unixfs is good at tailoring for usecases, not repeatable hashing of data.

I see 3 potential fixes:

  1. Add an option to the compiler to output a .car file, basically instead of relying on ipfs add magically perfectly outputing the same CID, you do not run 2 chunkers, the solc chunker would output the blocks in an archive and then the user could ipfs dag import (which read blocks for blocks instead of chunking). This is how chunkers are meant to work (this or using some other transport than car).
  2. Write a proposal and make a new spec for repeatable unixfs chunkers inside ipfs/specs and implement it, you could then use a single link inline CID with metadata to embed that into the CID. So the CIDs would encode unixfs-balanced-chunksize-256KiB-dag-pb-leaves-... and could be fed into an other implementation to have it the same.
  3. Replace all the multiblock and dagpb logic with a raw-blake3 CID. The reason we use the unixfs merkle dag format is unlike plain sha256 it supports for easy incremental verification, seeking (downloading random parts of the file without having to download the full file) and has very high exponential fanout (allows to do parallel multipeer downloads). All of thoses features are available builtin in well specified hash functions blake3 being one of them, this removes support for the most esoteric one like custom chunking, but instead adding the same files multiple times. Blake3 is also used by default by the new github.com/n0-computer/iroh implementation.

TL;DR:

You implement unixfs which is not a specified repeatable hash function (the same input can hash to different hashes depending on how the internal merkle-datastructure is built which is usecase dependent). Given your usecase is simple usually small text files I belive you should switch to use plain blake3 instead which is a well fixed merkletree (instead of the loose merkledag unixfs is).

Note 0

out of all the IPFS implementations I know only iroh knows how to handle blake3 incremental verification yet, other Kubo & friends supports blake3 but as dumb hashes, so it still uses unixfs + blake3 to handle files above the block limit 1~4MiB, we are intrested in adding this capability in the future.

Note 1

Even tho there is a one to many file bytes → CID unixfs relationship, assuming cryptographically secure hash functions there always is a unique CID → bytes relationship.

Note 2

Blake3 might not be the best sollution, what I am sure is that relying on random unspecified behaviours of some old piece of software is definitely wrong. :)

kuzdogan commented 1 year ago

Thanks for raising the issue. We at Sourcify make full use of this feature of Solidity to verify contracts including the contract metadata IPFS hash and make the metadata and source codes available on IPFS (see the playground)

I can't fully grasp the technical details but we ran into a similar reproducibility issue for some time when we switched our IPFS client to add with --nocopy. This changed the CID as it changed the chunking algorithm to use raw leaves instead of dag-pb, IIRC.

At the time, we also thought if it would make sense to use dag-json to encode the contract metadata, which is a JSON object. From what I understand, this is a better way to encode a JSON object and would remove the potential indeterminism caused by formatting, key ordering etc.

We are also serving all the files in our repo (/ipns/repo.sourcify.dev) which is basically a filesystem of millions of small files (metadata.json + solidity contracts). This is at times painful to manage when moving, sharing, others pinning the repo etc. and we were thinking if we could have a more optimal structure with IPLD or something similar to a database. Also because the repo being only a filesystem limits us in many ways compared to a DB we would be able to do queries and easily get stats/analytics of the repository. While discussing the how the Solidity compiler does the CID encoding, it might make sense to keep in mind this use case too.

Looking forward to your input and discussion.

bumblefudge commented 9 months ago

oh wow just saw this thread. if sourcify wants to design a more DB-like interface, research moving to CAR files as Jorropo suggesting (w3up cli could be useful testing out this approach), or anything else, feel free to ping me for help!