Closed liamsi closed 3 years ago
I'll look into adding a custom multihasher to the plugin. If that does not work, we might need to fork bitswap/ipfs (I hope we can avoid that though).
Unfortunately, adding a custom multihash isn't just doable via a plugin it seems. Several places check for particular good / valid hashes: e.g. Encode validates against a list of known codes: https://github.com/multiformats/go-multihash/blob/6f1ea18f1da5f7735ea31b5e2011da61c409e37f/multihash.go#L269-L270 https://github.com/multiformats/go-multihash/blob/6f1ea18f1da5f7735ea31b5e2011da61c409e37f/multihash.go#L38
This small library has a stricter list of allowed hashes: https://github.com/ipfs/go-verifcid/blob/a9d170f0d9246da3a1512396520b0d4eb7602c67/validate.go#L15-L31 and is called on pinning, garbage collection, inside the blockservice, and quite a few other places.
Also, I'm wondering if we need to add two custom multihashes then. The leaf and the inner nodes are handled differently in terms of the namespaces that get in front of the sha256 hash in the end. It feels a bit odd as other hashers are just this, hash functions that do know nothing about what they are hashing, but here we do need to make a difference for leaves and inner nodes. Additionally, we'd need to "ignore" the max namespace as this is used for the parity shares.
Basically, we'd want these two "hashers": https://github.com/lazyledger/nmt/blob/6e8a6a5ec3fcae4015ca5dd1923ce033a0458679/internal/hasher.go#L48-L107
Can't you assign different prefixes to inner node and leaf data? Then the hash function will know what it's hashing. That's what NebulousLabs Merkle tree does.
Can't you assign different prefixes to inner node and leaf data?
Yeah, we currently do this anyway (in the NMT as well as in the plugin when returning RawData). Yes, we could read that prefix byte and hash accordingly adding only one hasher. I think it's a bit cleaner to have two different hashers instead of one that parses the input and decides how to hash it according to the embedded prefix. Also, @Wondertan suggested introducing two different Codecs for inner and leaf nodes and not make that byte part of that RawData here; in that case we'd definitely need two hashers).
Here is a quick and dirty implementation of only using one hash function: https://github.com/lazyledger/go-multihash/blob/a31dec8c92fa0f526d20fbc54572953d5a08e03d/sum.go#L184
IMO, we should move most of the logic of that code to another repo and the hash function should be a few lines like the other ones in that file. That would be another good argument for two (instead of one) mulihashers, it would make these two functions even a bit cleaner.
For now, I'll keep the above version and see if everything works as expected (over the network). We can iterate on it if it does.
Experimented with using one mh in https://github.com/lazyledger/lazyledger-core/pull/155 which additionally required these changes: https://github.com/lazyledger/go-multihash/pull/1 https://github.com/lazyledger/go-ipfs/pull/1 https://github.com/lazyledger/go-verifcid/pull/1
I can confirm that this (almost) works: querying for namespaced CID works, retrieving the leaf data doesn't.
For example:
ipfs dag get bagao4amb5yataazahlu5shzrnd5enehjoa6mefxqvg4nav537bp6xje5ynoa3q2yjojhnumb4afjcqvs7kv6ufho3m
works as expected but actually going down the path to the leaf via
ipfs dag get bagao4amb5yataazahlu5shzrnd5enehjoa6mefxqvg4nav537bp6xje5ynoa3q2yjojhnumb4afjcqvs7kv6ufho3m/0/0/0/0/0
times out (or doesn't return).
Would expect sth like below as output instead (output from node that has the data locally and doesn't need to request it)::
The latter might be bc how we store the data in the leaf directly (instead of just the hashes and having one additional link), or, bc I've introduced some subtle bug when adding the mh.
Way fork MH if you could have just registered new hash func?
https://github.com/multiformats/go-multihash/blob/master/sum.go#229
Probably for no good reason. It just seemed the most consistent approach, as we definitely need to fork go-verifcid to add the custom multihash to the "good set": https://github.com/lazyledger/go-verifcid/blob/master/validate.go (note that this also means we need to fork go-ipfs as it uses go-verifcid internally). And go-verifcid uses go-multihash to add the other hashes. So if we ever want to upstream this all (surely in a cleaner version), we'd likely want to fork go-multihash first. If we can avoid forking go-multihash for now, that would certainly be preferable as each fork comes with a non-negligible maintenance overhead... I'll spend the 2nd half of this week on this. Though the priority will be to understand what is going on with the leaves / retrieving the leaf data first.
@liamsi, can you pls specify all the use cases for IPFS you are willing to have for lazyledger? I need that to understand the picture widely and to propose other possible workarounds not to introduce, IMO, redundant forks.
Yes, happy to summarize it here shortly. Will try do so before end of the week. In the meantime, a slightly out of date and relatively vague (and probably confusing) description can be found here: https://github.com/lazyledger/lazyledger-core/issues/35
@liamsi, the issue you’ve mentioned is probably enough, and I should have read it before, as I did some wrong assumptions regarding your use case and, therefore, gave slightly wrong suggestions. Going to expand on this soon.
To very briefly summarize, we will use IPFS in the following cases:
CID/0/1/1/0/0
of a row or column tree and validate that it checks out with with the roots in the DataAvailabilityHeaderIdeally, we want the hashes in above data availability root to match with the CIDs (it's OK that IPFS adds a few bytes of overhead to identify the multihash etc.). That is why we want to include the namespaces in the CIDs. Also, that's why I think your suggestion to add one or multiple custom multihashers was exactly right. Actually, I do not see a way around this now. While I could get around adding a custom multihash when initially using a simple merkle tree for some very first experiments, we can't do the same if we flag the resulting sha256 hash with the min/max namespace (which we want to do as explained above).
Looking forward to your input!
On a side note, we will also use a DHT to find nodes that stored data for a particular namespace. But I have not yet thought this through but we will likely use a separate protocol on top (using libp2p).
Regarding retrieving the leaves: the bug was indeed in here (cc @musalbas): https://github.com/lazyledger/go-multihash/blob/a31dec8c92fa0f526d20fbc54572953d5a08e03d/sum.go#L202-L207 These two commits (in combination) fixed it: https://github.com/lazyledger/go-multihash/commit/1743be8d8a5bf1b66b1910780599cef5313b61d8 and https://github.com/lazyledger/go-multihash/commit/88bad1265973e3d986433b60817d2e49a72812f3
Summary: when hashing the leaf of the form nid || rawData
inside the NMT, we return nid || nid || sha256(leafPrefix || rawData)
but as an IPFS block for a leaf we return leafPrefix || nid || rawData
here. And the custom multihasher did not reflect that exact behaviour of the NMT previously but instead returned nid || nid || sha256(leafPrefix || nid || rawData)
(need to add more tests).
IMO, two custom hasher would be a bit cleaner. Also moving the hashing code to the NMT repo and just registering the hasher from ll-core (instead of forking go-multihash) would indeed be preferable too (as @Wondertan suggested above).
On a side note, we will also use a DHT to find nodes that stored data for a particular namespace. But I have not yet thought this through but we will likely use a separate protocol on top (using libp2p).
As @musalbas mentioned on discord, for a start we can simply use the same mechanism as for DA proofs:
The client does a "search " of tree on IPFS to find the leafs for a namespace, which should be O(log(n)) As each node has its min and max namespace id
I can confirm that the current implementation of the ipld plugin does not work properly with bitswap as mentioned here: https://github.com/lazyledger/lazyledger-core/pull/152#discussion_r565689152
I've tested this by spinning up two droplets on DO that otherwise can retrieve data from each other (e.g. via dag get) if the Cid is a regular sha256 one but not if the namespaces are included in the cid.
It seems like defining one encoding for the nodes of the NMT: https://github.com/lazyledger/lazyledger-core/blob/b83e6766973c314991ec8ba9f6f246fc422fb605/p2p/ipld/plugin/plugin.go#L26
and then simply squeezing the namespaces into the CIDs directly: https://github.com/lazyledger/lazyledger-core/blob/b83e6766973c314991ec8ba9f6f246fc422fb605/p2p/ipld/plugin/plugin.go#L341-L345
does not work with bitswap. @Wondertan mentioned this a source for the problem (when receiving a block the cid gets recomputed locally): https://github.com/ipfs/go-bitswap/blob/47b99b1ce34a8add8e5f38cf2eec6bea1559b035/message/message.go#L217-L222
Notice that pref.Sum seems to only accept 32 bytes:https://github.com/ipfs/go-cid/blob/e530276a7008f5973e7da6640ed305ecc5825d27/cid.go#L563-L577