filecoin-station / spark

💥 Storage Provider Retrieval Checker as a Filecoin Station Module 🛰️
https://filspark.com
Other
9 stars 2 forks source link

Recover faster after network outage #49

Open bajtos opened 10 months ago

bajtos commented 10 months ago

https://github.com/filecoin-station/spark/pull/47 increased the delay between retrievals to ~60 seconds. We are now waiting for 60 seconds before we try to connect to spark-api after being offline.

Workaround: restart the Station after coming online.

Proposed fix:

bajtos commented 10 months ago

Possibly related:

juliangruber commented 10 months ago

https://github.com/filecoin-station/spark/pull/47 increased the delay between retrievals to ~60 seconds. We are now waiting for 60 seconds before we try to connect to spark-api after being offline.

Workaround: restart the Station after coming online.

I want to make sure I understand the problem statement. Why is it a problem to wait 60 seconds after having been offline? Isn't it ok to be offline, then wait 60 seconds, then try again? And why does restarting Station fix this?

bajtos commented 9 months ago

Here is what I observed:

This behaviour creates an impression that the Station cannot correctly detect the transition of the computer from offline to online. (Personally, I perceive such behaviour as the app developers' sloppiness, and I don't want to perceive myself as a sloppy person.)

why does restarting Station fix this?

IIUC, the Station decides whether we are offline or online based on the outcome of a SPARK iteration. The Station goes offline when SPARK cannot fetch round details or submit the measurement. When we are offline, and SPARK reports that it was able to fetch round details, we go back online.

This worked well when the delay between jobs was ~10 seconds. It no longer works with the current ~60-second delay because it can take up to 60 seconds before Station/SPARK can detect that we are back online.

When I restart the Station, SPARK starts the next job immediately and therefore the Station quickly transitions to the online status.

Here is the main SPARK loop:

https://github.com/filecoin-station/spark/blob/fc756cf9720a31af11148df77ce2d716569a84ff/lib/spark.js#L165-L187

I propose modifying the following line to calculate different delays based on whether we are in a healthy (online) state.

https://github.com/filecoin-station/spark/blob/fc756cf9720a31af11148df77ce2d716569a84ff/lib/spark.js#L181