bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
641 stars 85 forks source link

Issues with IPFS routing #2948

Closed simonwo closed 1 day ago

simonwo commented 8 months ago

We have observed some unfortunate behaviour with a private IPFS cluster that means results are not downloadable. The issue appears to be that nodes that are in the same swarm as each other don't appear to discover content on each other.

An example setup:

Expected: results are downloaded successfully as requester IPFS node is able to get the data from the compute IPFS node Observed: download times out as requester is seemingly unable to find the data Observed: problem disappears if ipfs pin is used to explicitly and permanently cache the data on the node

Expected: results are downloaded successfully Observed: download times out even though this should be directly where the data is stored (??)

Hints: IPFS node bootstrap config might be an issue, by default it points at public nodes. May be worth leaving it empty or setting it to the same as the peering list.

Context:

alaric-rd commented 7 months ago

Observations so far:

I've not tried fixing a partial-connectivity problem by pinning an object on another node. If that worked, I presume it was due to a situation where connectivity was not transitive, so pinning an object on an intermediate node that can see both the source node and a requesting node made the intermediate node fetch it from the source, so the requesting node could then see it. However, the long-term fix to such issues is clearly not to pin everything everywhere, as pinning objects reserves space for them on the pinning node - unless we also have a mechanism to also unpin them at some point!

We had some IPFS complaints, that were reportedly fixed by opening firewall ports (port 4001 had been opened for ingress, but not for egress, I gather). I recommend that, if such IPFS problems are found again, we record the result of ipfs swarm peers on every node along with the other particulars of the problem experienced, and look deeper into it.

We may, in the longer run, consider tweaks to our installation of kubo in packaged installations to better fit a private-cluster setup, improving kubo's error reporting and submitting PRs upstream, or giving bacalhau diagnostic capabilities to test whether IPFS is presenting a single-system image (eg, try to ipfs get the same CID from every node in the cluster, and return the result of ipfs swarm peers for all nodes) so we can more rapidly progress from "bacalhau get hangs" to a diagnosis.

wdbaruni commented 1 day ago

embedded ipfs node has been deprecated in favour of connecting to your own node https://github.com/bacalhau-project/bacalhau/pull/4061