Is it possible to continue monitoring nodes after one unsuccessful dial?

wcgcyx commented 3 years ago

Hello, it seems like after one unsuccessful dial attempt, the monitor will mark a node as unreachable and will not attempt any further dial. It is possible to continue monitoring the disconnected nodes for a period of time before considering them as permanently unreachable? In this way, we can possibly gather some data on the disconnected node to analyze the offline pattern for some server nodes (for example, some nodes may have regular offline time due to various reasons).

dennis-tra commented 3 years ago

Hi @wcgcyx,

thanks for your interest in the crawler!

There is actually some semi-sophisticated logic around retries when dialing nodes in the monitoring mode of the crawler. See this monster: https://github.com/dennis-tra/nebula-crawler/blob/4c88f591f3ea2fe17f45380b69be901709e2fa47/pkg/monitor/worker.go#L88-L141

There are some error cases when retries don't really make sense. However, you're right, that some nodes are definitely following patterns - otherwise there wouldn't be this obvious periodicity: These are the crawl results of IPFS nodes from the last 7 days.

The current state of the crawler would allow the analysis of online/offline patterns if the PeerID stays the same. It would be a little bit more complicated if the PeerID changed and the MultiAddr stayed constant - but it would be possible as well.

Every time the crawler considers a node as offline it "closes" a session in the database. As soon as the node is found again (I'm crawling every 30m) a new session row is inserted into the database for the same PeerID. So, basically, the information is there.

Do you have an idea of any preferred or more straight forward way of tracking such nodes?

wcgcyx commented 3 years ago

Hello, @dennis-tra

Thanks for your reply. Yes, you are right. What I was trying to say is if a node is considered offline then the monitor will close the session as you said.

The graph you showed is impressive. When you say you are crawling every 30m, do you mean that the monitor process is always running and meanwhile you run the crawling process every 30m?

I was thinking of continuing to monitor the offline nodes and figure out if they will come back online later, and if so, maybe also find a way to figure out why. Maybe it is related to geolocation? Or maybe it is related to something else.

By the way, how do you deal with a changing PeerID but the same MultiAddr? Is it something in the code now?

dennis-tra commented 3 years ago

Hello, @dennis-tra

Thanks for your reply. Yes, you are right. What I was trying to say is if a node is considered offline then the monitor will close the session as you said.

👍 :)

The graph you showed is impressive. When you say you are crawling every 30m, do you mean that the monitor process is always running and meanwhile you run the crawling process every 30m?

That's correct! I have configured a cron job to invokes nebula crawl every 30m and a service definition for the monitoring task, so that the process is restarted if it exists unexpectedly (hasn't happened yet). You can find that one here.

I was thinking of continuing to monitor the offline nodes and figure out if they will come back online later, and if so, maybe also find a way to figure out why. Maybe it is related to geolocation? Or maybe it is related to something else.

That would be really interesting to find out! The monitoring process can certainly be extended to grab all peers that have recently (or based on other criteria) went offline.

By the way, how do you deal with a changing PeerID but the same MultiAddr? Is it something in the code now?

If a peer is found having a different MultiAddr, the old one is persisted in the Peers table. Have a look at this migration for more information. Basically, if the crawler finds that a Peer with ID peerId1 was found with multiAddr1 in the DHT and in the subsequent crawl the same peer with ID peerId1 is found with multiAddr2 we know that the peer must have been offline for a short period of time. This logic is captured here :)

dennis-tra / nebula

Is it possible to continue monitoring nodes after one unsuccessful dial? #3