Closed simonwo closed 5 months ago
Observations so far:
ipfs bootstrap rm --all
to prevent connecting to the public IPFS network must be run on every node after ipfs init
, but if it's already started up without that having happened, it will remember those peers it's already found. So it's easy to end up peering with the public IPFS network by accident, and then it's hard to break that peering again.ipfs swarm connect
to tell a node about a peer after startup seems to work better than adding the peer as a bootstrap peer, but I've not investigated why yet.bacalhau get
will merrily try to talk to public IPFS unless you pass --ipfs-connect
, even if the bacalhau cluster has a private IPFS of its own. It would be helpful if bacalhau get
was more aware of where bacalhau had actually put the results of a job and went looking in the right places on its own.I've not tried fixing a partial-connectivity problem by pinning an object on another node. If that worked, I presume it was due to a situation where connectivity was not transitive, so pinning an object on an intermediate node that can see both the source node and a requesting node made the intermediate node fetch it from the source, so the requesting node could then see it. However, the long-term fix to such issues is clearly not to pin everything everywhere, as pinning objects reserves space for them on the pinning node - unless we also have a mechanism to also unpin them at some point!
We had some IPFS complaints, that were reportedly fixed by opening firewall ports (port 4001 had been opened for ingress, but not for egress, I gather). I recommend that, if such IPFS problems are found again, we record the result of ipfs swarm peers
on every node along with the other particulars of the problem experienced, and look deeper into it.
We may, in the longer run, consider tweaks to our installation of kubo in packaged installations to better fit a private-cluster setup, improving kubo's error reporting and submitting PRs upstream, or giving bacalhau diagnostic capabilities to test whether IPFS is presenting a single-system image (eg, try to ipfs get
the same CID from every node in the cluster, and return the result of ipfs swarm peers
for all nodes) so we can more rapidly progress from "bacalhau get hangs" to a diagnosis.
embedded ipfs node has been deprecated in favour of connecting to your own node https://github.com/bacalhau-project/bacalhau/pull/4061
We have observed some unfortunate behaviour with a private IPFS cluster that means results are not downloadable. The issue appears to be that nodes that are in the same swarm as each other don't appear to discover content on each other.
An example setup:
Expected: results are downloaded successfully as requester IPFS node is able to get the data from the compute IPFS node Observed: download times out as requester is seemingly unable to find the data Observed: problem disappears if
ipfs pin
is used to explicitly and permanently cache the data on the nodeExpected: results are downloaded successfully Observed: download times out even though this should be directly where the data is stored (??)
Hints: IPFS node bootstrap config might be an issue, by default it points at public nodes. May be worth leaving it empty or setting it to the same as the peering list.
Context: