Issues with IPFS routing

simonwo commented 1 year ago

We have observed some unfortunate behaviour with a private IPFS cluster that means results are not downloadable. The issue appears to be that nodes that are in the same swarm as each other don't appear to discover content on each other.

An example setup:

Two separate machines – one running Bacalhau in requester mode, one in compute mode, both with kubo IPFS nodes
The IPFS nodes have each other listed in their peering config
Run a Bacalhau job which will generate some result in the compute node
Run Bacalhau get with the IPFS swarm addr of the requester node

Expected: results are downloaded successfully as requester IPFS node is able to get the data from the compute IPFS node Observed: download times out as requester is seemingly unable to find the data Observed: problem disappears if ipfs pin is used to explicitly and permanently cache the data on the node

And/or, Run Bacalhau get with the IPFS swarm addr of the compute node

Expected: results are downloaded successfully Observed: download times out even though this should be directly where the data is stored (??)

Hints: IPFS node bootstrap config might be an issue, by default it points at public nodes. May be worth leaving it empty or setting it to the same as the peering list.

Context:

https://bacalhauproject.slack.com/archives/C055U60AJKE/p1697211889274979

alaric-rd commented 1 year ago

Observations so far:

kubo likes to silently fail, or at best give unhelpful error messages. Tinkering with my own cluster, I've seen situations where one node accepts TCP connections from another node, then immediately closes them. The node logs nothing to indicate why, and the connecting peer logs a confusing message about a problem with a security nonce (because the first thing it tries to read from the connection is a nonce string, which fails as it's been closed).
If there's a connectivity problem within IPFS, it will tend to hang and time out looking for stuff rather than giving up quickly. This is appropriate for public IPFS where DHT nodes may be flakey and all that, but on a private IPFS cluster it's a bit of a pain (and when combined with the last point, is often the first sign that your cluster isn't peering properly). It would be nice to tune kubo for a private cluster, with smaller timeouts for a start, and probably many other tunables. A private cluster may well involve one (or two or three, for redundancy) well-known central IPFS nodes that everyone else peers with, acting as a central transfer point, for instance.
Sometimes, deleting my IPFS data dir and running init again fixes a node that won't peer nicely. Some kind of persistent state lurks in the data dir.
ipfs bootstrap rm --all to prevent connecting to the public IPFS network must be run on every node after ipfs init, but if it's already started up without that having happened, it will remember those peers it's already found. So it's easy to end up peering with the public IPFS network by accident, and then it's hard to break that peering again.
ipfs swarm connect to tell a node about a peer after startup seems to work better than adding the peer as a bootstrap peer, but I've not investigated why yet.
bacalhau get will merrily try to talk to public IPFS unless you pass --ipfs-connect, even if the bacalhau cluster has a private IPFS of its own. It would be helpful if bacalhau get was more aware of where bacalhau had actually put the results of a job and went looking in the right places on its own.

I've not tried fixing a partial-connectivity problem by pinning an object on another node. If that worked, I presume it was due to a situation where connectivity was not transitive, so pinning an object on an intermediate node that can see both the source node and a requesting node made the intermediate node fetch it from the source, so the requesting node could then see it. However, the long-term fix to such issues is clearly not to pin everything everywhere, as pinning objects reserves space for them on the pinning node - unless we also have a mechanism to also unpin them at some point!

We had some IPFS complaints, that were reportedly fixed by opening firewall ports (port 4001 had been opened for ingress, but not for egress, I gather). I recommend that, if such IPFS problems are found again, we record the result of ipfs swarm peers on every node along with the other particulars of the problem experienced, and look deeper into it.

We may, in the longer run, consider tweaks to our installation of kubo in packaged installations to better fit a private-cluster setup, improving kubo's error reporting and submitting PRs upstream, or giving bacalhau diagnostic capabilities to test whether IPFS is presenting a single-system image (eg, try to ipfs get the same CID from every node in the cluster, and return the result of ipfs swarm peers for all nodes) so we can more rapidly progress from "bacalhau get hangs" to a diagnosis.

wdbaruni commented 5 months ago

embedded ipfs node has been deprecated in favour of connecting to your own node https://github.com/bacalhau-project/bacalhau/pull/4061

bacalhau-project / bacalhau

Issues with IPFS routing #2948