Closed v-stickykeys closed 3 years ago
@zachferland Tried this again today and confirmed, we need steps 2 and 3 as mentioned above and it begins to work
Also noticed that when I made a change to infra pubsub stopped working.
In this case I changed the healthcheck endpoint for the ceramic albs and the ipfs nodes were not restarted but the ceramic instances were when the change was applied.
Then I had to restart the ceramic tasks again and it started working.
hmm yeah we should definitely not need to delete config again (or everytime), and weird that there would be any issue making changes/updating ceramic alone. I suspect there is still other compounding issues. But the issue described in discord definitely holds.
Ceramic node only calls subscribe on start, if ipfs node restarts during that time, the ipfs node will no longer be subscribed to that topic and the ceramic node will no longer have a handler/connection to receive messages. Confirmed that behavior today (as expected). Even worse in ceramic node subscribe will not visibly fail or emit any events/errors if that connection is closed (ie ipfs down), it will just stop 'receiving messages'.
Looked into it more and unfortunately there doesnt seem to be any great solution. There is no available event or errors. We cant necessarily poll for topics even to see if subscribed still, because there can also be cases where the ipfs node is subscribed to the topic again (from another client), but the subscribe handler in the ceramic node is no longer receiving messages, and there is no easy way to determine that. I think likely an error handler/event or something should be added in the ipfs http api lib so that the client knows it should start attempting reconnects.
Should this be raised to the js-ipfs team?
From testing today I was still able to receive pubsub messages in js-ceramic after stopping the ipfs node, IF i continuously sent messages to the topic it was subscribed to. When I stopped sending messages and then started again after a ~30s delay it stopped receiving them. Also from the node that I ran locally to send the messages it continued to list the infra node as a peer with it's peer id even after the infra node was down.
This seems to indicate
though I don't think this is a full picture of what is happening still.
@zachferland yeah it seems you are right actually. deleting config is not a necessary step. i think it was more coincidental. Tried today and restarting the instance got me reconnected--didn't have to remove s3 files.
I think this narrows down the problem to just IPFS restarting..
what we can do when this happens right now is something like
needsResub(topic) {
let resub = false
try {
const topics = await ipfs.pubsub.ls()
if (!topic in topics) resub = true
} catch (error) {
console.error('ipfs api connection failing')
resub = true
}
return resub
}
and then call this before every pubsub.publish
, this would still result in some messages not being received though until we actually try to send and then successfully resub
Pasting what wrote in discord... This is what I’m seeing now
Resolved
So far we are only seeing pubsub messages received on our infra's remote version of this node when we
1 - have fixed IPFS versions (as in https://github.com/ceramicnetwork/js-ipfs-ceramic/commit/59df199150894067d5dd27059a10669c3b41f867)
~2 - delete the
config
file created by ipfs~ (this isn't strictly necessary)3 - restart the js-ceramic instance using this ipfs node
It's not exactly clear what the issue is but it seems to be some inconsistent state/configs between ipfs node and ipfs client in ceramic node