celestiaorg / celestia-core

A fork of CometBFT
Apache License 2.0
489 stars 269 forks source link

bug: ipld plugin only works locally #153

Closed liamsi closed 3 years ago

liamsi commented 3 years ago

I can confirm that the current implementation of the ipld plugin does not work properly with bitswap as mentioned here: https://github.com/lazyledger/lazyledger-core/pull/152#discussion_r565689152

I've tested this by spinning up two droplets on DO that otherwise can retrieve data from each other (e.g. via dag get) if the Cid is a regular sha256 one but not if the namespaces are included in the cid.

It seems like defining one encoding for the nodes of the NMT: https://github.com/lazyledger/lazyledger-core/blob/b83e6766973c314991ec8ba9f6f246fc422fb605/p2p/ipld/plugin/plugin.go#L26

and then simply squeezing the namespaces into the CIDs directly: https://github.com/lazyledger/lazyledger-core/blob/b83e6766973c314991ec8ba9f6f246fc422fb605/p2p/ipld/plugin/plugin.go#L341-L345

does not work with bitswap. @Wondertan mentioned this a source for the problem (when receiving a block the cid gets recomputed locally): https://github.com/ipfs/go-bitswap/blob/47b99b1ce34a8add8e5f38cf2eec6bea1559b035/message/message.go#L217-L222 Notice that pref.Sum seems to only accept 32 bytes: https://github.com/ipfs/go-cid/blob/e530276a7008f5973e7da6640ed305ecc5825d27/cid.go#L563-L577

liamsi commented 3 years ago

I'll look into adding a custom multihasher to the plugin. If that does not work, we might need to fork bitswap/ipfs (I hope we can avoid that though).

liamsi commented 3 years ago

Unfortunately, adding a custom multihash isn't just doable via a plugin it seems. Several places check for particular good / valid hashes: e.g. Encode validates against a list of known codes: https://github.com/multiformats/go-multihash/blob/6f1ea18f1da5f7735ea31b5e2011da61c409e37f/multihash.go#L269-L270 https://github.com/multiformats/go-multihash/blob/6f1ea18f1da5f7735ea31b5e2011da61c409e37f/multihash.go#L38

This small library has a stricter list of allowed hashes: https://github.com/ipfs/go-verifcid/blob/a9d170f0d9246da3a1512396520b0d4eb7602c67/validate.go#L15-L31 and is called on pinning, garbage collection, inside the blockservice, and quite a few other places.

liamsi commented 3 years ago

Also, I'm wondering if we need to add two custom multihashes then. The leaf and the inner nodes are handled differently in terms of the namespaces that get in front of the sha256 hash in the end. It feels a bit odd as other hashers are just this, hash functions that do know nothing about what they are hashing, but here we do need to make a difference for leaves and inner nodes. Additionally, we'd need to "ignore" the max namespace as this is used for the parity shares.

Basically, we'd want these two "hashers": https://github.com/lazyledger/nmt/blob/6e8a6a5ec3fcae4015ca5dd1923ce033a0458679/internal/hasher.go#L48-L107

musalbas commented 3 years ago

Can't you assign different prefixes to inner node and leaf data? Then the hash function will know what it's hashing. That's what NebulousLabs Merkle tree does.

liamsi commented 3 years ago

Can't you assign different prefixes to inner node and leaf data?

Yeah, we currently do this anyway (in the NMT as well as in the plugin when returning RawData). Yes, we could read that prefix byte and hash accordingly adding only one hasher. I think it's a bit cleaner to have two different hashers instead of one that parses the input and decides how to hash it according to the embedded prefix. Also, @Wondertan suggested introducing two different Codecs for inner and leaf nodes and not make that byte part of that RawData here; in that case we'd definitely need two hashers).

liamsi commented 3 years ago

Here is a quick and dirty implementation of only using one hash function: https://github.com/lazyledger/go-multihash/blob/a31dec8c92fa0f526d20fbc54572953d5a08e03d/sum.go#L184

IMO, we should move most of the logic of that code to another repo and the hash function should be a few lines like the other ones in that file. That would be another good argument for two (instead of one) mulihashers, it would make these two functions even a bit cleaner.

For now, I'll keep the above version and see if everything works as expected (over the network). We can iterate on it if it does.

liamsi commented 3 years ago

Experimented with using one mh in https://github.com/lazyledger/lazyledger-core/pull/155 which additionally required these changes: https://github.com/lazyledger/go-multihash/pull/1 https://github.com/lazyledger/go-ipfs/pull/1 https://github.com/lazyledger/go-verifcid/pull/1

I can confirm that this (almost) works: querying for namespaced CID works, retrieving the leaf data doesn't. For example: ipfs dag get bagao4amb5yataazahlu5shzrnd5enehjoa6mefxqvg4nav537bp6xje5ynoa3q2yjojhnumb4afjcqvs7kv6ufho3m works as expected but actually going down the path to the leaf via ipfs dag get bagao4amb5yataazahlu5shzrnd5enehjoa6mefxqvg4nav537bp6xje5ynoa3q2yjojhnumb4afjcqvs7kv6ufho3m/0/0/0/0/0 times out (or doesn't return).

Would expect sth like below as output instead (output from node that has the data locally and doesn't need to request it)::

Click here for output: ``` $ipfs dag get bagao4amb5yataazahlu5shzrnd5enehjoa6mefxqvg4nav537bp6xje5ynoa3q2yjojhnumb4afjcqvs7kv6ufho3m/0/0/0/0/0 {"Data":"AyA66dkfMWg7d7ZrUXE1YYwJYDrYfwnJkWC9GOrQD7/Ou/gGQS8ld/nIM8O64ZK+jcvS9TXLIVuxlR47VeCgBJRxwpkU2XOMIiWbySpcwYl2mEKqyfAsj7Qa/xTyNEKlNfXVMRkWJGvO2NfzwlnJk9DH7G82MHitMMZUQM1vjw1NvTVn25adRqgaivg58BsCuAkw31EDiv8mcwfLBVyYEIc9UIxXDXpaBB5v5wcCBJYLvkRGBOrVbzer/vUdGIsj9DHSNxhkuUZGBoaiNgIaanrYpnCWF5iOsb5kCU4hl2QufjqrMexz9aVcIO/shcu32i3a4faivf0ATAxvcY1KVBPYkurwTkjL"} ```

The latter might be bc how we store the data in the leaf directly (instead of just the hashes and having one additional link), or, bc I've introduced some subtle bug when adding the mh.

Wondertan commented 3 years ago

Way fork MH if you could have just registered new hash func?

https://github.com/multiformats/go-multihash/blob/master/sum.go#229

liamsi commented 3 years ago

Probably for no good reason. It just seemed the most consistent approach, as we definitely need to fork go-verifcid to add the custom multihash to the "good set": https://github.com/lazyledger/go-verifcid/blob/master/validate.go (note that this also means we need to fork go-ipfs as it uses go-verifcid internally). And go-verifcid uses go-multihash to add the other hashes. So if we ever want to upstream this all (surely in a cleaner version), we'd likely want to fork go-multihash first. If we can avoid forking go-multihash for now, that would certainly be preferable as each fork comes with a non-negligible maintenance overhead... I'll spend the 2nd half of this week on this. Though the priority will be to understand what is going on with the leaves / retrieving the leaf data first.

Wondertan commented 3 years ago

@liamsi, can you pls specify all the use cases for IPFS you are willing to have for lazyledger? I need that to understand the picture widely and to propose other possible workarounds not to introduce, IMO, redundant forks.

liamsi commented 3 years ago

Yes, happy to summarize it here shortly. Will try do so before end of the week. In the meantime, a slightly out of date and relatively vague (and probably confusing) description can be found here: https://github.com/lazyledger/lazyledger-core/issues/35

Wondertan commented 3 years ago

@liamsi, the issue you’ve mentioned is probably enough, and I should have read it before, as I did some wrong assumptions regarding your use case and, therefore, gave slightly wrong suggestions. Going to expand on this soon.

liamsi commented 3 years ago

To very briefly summarize, we will use IPFS in the following cases:

  1. as a (LL) block store, which means storing the erasure coded version of blocks (see the image below)
  2. for Data Availability proofs: sample cells from the erasure coded version of a block (see image below), e.g. retrieve the data of CID/0/1/1/0/0 of a row or column tree and validate that it checks out with with the roots in the DataAvailabilityHeader

Ideally, we want the hashes in above data availability root to match with the CIDs (it's OK that IPFS adds a few bytes of overhead to identify the multihash etc.). That is why we want to include the namespaces in the CIDs. Also, that's why I think your suggestion to add one or multiple custom multihashers was exactly right. Actually, I do not see a way around this now. While I could get around adding a custom multihash when initially using a simple merkle tree for some very first experiments, we can't do the same if we flag the resulting sha256 hash with the min/max namespace (which we want to do as explained above).

Looking forward to your input!

On a side note, we will also use a DHT to find nodes that stored data for a particular namespace. But I have not yet thought this through but we will likely use a separate protocol on top (using libp2p).

image

liamsi commented 3 years ago

Regarding retrieving the leaves: the bug was indeed in here (cc @musalbas): https://github.com/lazyledger/go-multihash/blob/a31dec8c92fa0f526d20fbc54572953d5a08e03d/sum.go#L202-L207 These two commits (in combination) fixed it: https://github.com/lazyledger/go-multihash/commit/1743be8d8a5bf1b66b1910780599cef5313b61d8 and https://github.com/lazyledger/go-multihash/commit/88bad1265973e3d986433b60817d2e49a72812f3

Summary: when hashing the leaf of the form nid || rawData inside the NMT, we return nid || nid || sha256(leafPrefix || rawData)but as an IPFS block for a leaf we return leafPrefix || nid || rawData here. And the custom multihasher did not reflect that exact behaviour of the NMT previously but instead returned nid || nid || sha256(leafPrefix || nid || rawData) (need to add more tests).

IMO, two custom hasher would be a bit cleaner. Also moving the hashing code to the NMT repo and just registering the hasher from ll-core (instead of forking go-multihash) would indeed be preferable too (as @Wondertan suggested above).

liamsi commented 3 years ago

On a side note, we will also use a DHT to find nodes that stored data for a particular namespace. But I have not yet thought this through but we will likely use a separate protocol on top (using libp2p).

As @musalbas mentioned on discord, for a start we can simply use the same mechanism as for DA proofs:

The client does a "search " of tree on IPFS to find the leafs for a namespace, which should be O(log(n)) As each node has its min and max namespace id