application-research / autoretrieve

A server to make GraphSync data accessible on IPFS
22 stars 7 forks source link

Monitor miner failures to avoid failure spam #90

Closed elijaharita closed 2 years ago

elijaharita commented 2 years ago

I have a theory that some memory spikes could be caused by sudden bursts of failed requests to miners. This introduces a struct that tracks miner failures and temporarily suspends new retrievals from being started with miners that have generated too many errors in a short amount of time.

This'll also help reduce log spam a bit.

Currently the config struct is hardcoded at maximum 5 failures allowed within 15 seconds, with suspension duration of 1 minute. YAML options can be added in a subsequent PR.

I think @kylehuntsman might want to look into adding a metric for how many miners are currently suspended later (and maybe how many are properly working? I can share some info about how to access that info)

elijaharita commented 2 years ago

All my comments are non-blocking. LGTM.

Just to confirm my understanding of your code logic, based on the configuration you have:

* if I get 5 failures in 15 seconds, I get suspended for 1 minute

* during the minute I'm suspended, there can potentially be a re-upp of my suspension if more failures come in (presumably this should drop off quickly cause code in the retriever stops going to the miner once its suspended, but maybe there are already things in flight)

yes, exactly right : )