Open Jorropo opened 1 year ago
Thanks for raising the issue. We at Sourcify make full use of this feature of Solidity to verify contracts including the contract metadata IPFS hash and make the metadata and source codes available on IPFS (see the playground)
I can't fully grasp the technical details but we ran into a similar reproducibility issue for some time when we switched our IPFS client to add with --nocopy
. This changed the CID as it changed the chunking algorithm to use raw leaves instead of dag-pb
, IIRC.
At the time, we also thought if it would make sense to use dag-json
to encode the contract metadata, which is a JSON object. From what I understand, this is a better way to encode a JSON object and would remove the potential indeterminism caused by formatting, key ordering etc.
We are also serving all the files in our repo (/ipns/repo.sourcify.dev) which is basically a filesystem of millions of small files (metadata.json + solidity contracts). This is at times painful to manage when moving, sharing, others pinning the repo etc. and we were thinking if we could have a more optimal structure with IPLD or something similar to a database. Also because the repo being only a filesystem limits us in many ways compared to a DB we would be able to do queries and easily get stats/analytics of the repository. While discussing the how the Solidity compiler does the CID encoding, it might make sense to keep in mind this use case too.
Looking forward to your input and discussion.
oh wow just saw this thread. if sourcify wants to design a more DB-like interface, research moving to CAR files as Jorropo suggesting (w3up cli could be useful testing out this approach), or anything else, feel free to ping me for help!
It looks like you are implementing what looks like the Kubo defaults, they are nearing 10 years and lack the newest features we support, I thus want to change thoses so I am poking around where people rely on thoses defaults in the ecosystem.
https://github.com/ethereum/solidity/blob/develop/libsolutil/IpfsHash.cpp
Unixfs is an open format which allows for multiple writer implementations to implement their own linking logic such as append logs, content aware chunking (cutting around logical boundries in the content, such as iframes in video files, content in archive formats, ...), more packed representation, ... while all of thoses are automatically compatible with all reader implementations. This as designed lead to a inconsistent hashes in the ecosystem, examples with implementations that produce different CIDs:
github.com/Jorropo/linux2ipfs
use 2MiB raw leaves with 2MiB roots (instead of 174 links).github.com/ipld/go-car/cmd/car
use a different TSize logic.github.com/ipfs/boxo/mfs
(which is available inKubo
withipfs files ...
) has different defaults and can produce identical files with different CIDs if you use a different list of copy, write, append, ... operations.github.com/filecoin-project/lotus
(I belive) uses raw leaves with 1MiB chunks and 1024 links with some variant of blake2web3.storage
&nft.storage
use raw leaves with 1MiB chunksgithub.com/bmwiedemann/ipfs-iso-jigsaw
chunk each file in an ISO separately and then concatenate the resulting files with the ISO metadata in a unixfs root allowing different versions of similar isos to share the blocks for the unchanged files (incremental file updates).Hopefully this serves as a demonstration that unixfs is good at tailoring for usecases, not repeatable hashing of data.
I see 3 potential fixes:
.car
file, basically instead of relying onipfs add
magically perfectly outputing the same CID, you do not run 2 chunkers, the solc chunker would output the blocks in an archive and then the user couldipfs dag import
(which read blocks for blocks instead of chunking). This is how chunkers are meant to work (this or using some other transport than car).ipfs/specs
and implement it, you could then use a single link inline CID with metadata to embed that into the CID. So the CIDs would encodeunixfs-balanced-chunksize-256KiB-dag-pb-leaves-...
and could be fed into an other implementation to have it the same.raw-blake3
CID. The reason we use the unixfs merkle dag format is unlike plain sha256 it supports for easy incremental verification, seeking (downloading random parts of the file without having to download the full file) and has very high exponential fanout (allows to do parallel multipeer downloads). All of thoses features are available builtin in well specified hash functions blake3 being one of them, this removes support for the most esoteric one like custom chunking, but instead adding the same files multiple times. Blake3 is also used by default by the newgithub.com/n0-computer/iroh
implementation.TL;DR:
You implement unixfs which is not a specified repeatable hash function (the same input can hash to different hashes depending on how the internal merkle-datastructure is built which is usecase dependent). Given your usecase is simple usually small text files I belive you should switch to use plain blake3 instead which is a well fixed merkletree (instead of the loose merkledag unixfs is).
Note 0
out of all the IPFS implementations I know only iroh knows how to handle blake3 incremental verification yet, other Kubo & friends supports blake3 but as dumb hashes, so it still uses unixfs + blake3 to handle files above the block limit 1~4MiB, we are intrested in adding this capability in the future.
Note 1
Even tho there is a one to many
file bytes → CID
unixfs relationship, assuming cryptographically secure hash functions there always is a uniqueCID → bytes
relationship.Note 2
Blake3 might not be the best sollution, what I am sure is that relying on random unspecified behaviours of some old piece of software is definitely wrong. :)