Add --max-retries parameter for API connection in boost client

parkan commented 2 years ago

Our production environment sees a fairly high rate of Lotus API connection errors (unexpected EOF, broken pipe, or context deadline exceeded) when dialing api.chain.love or glif. These errors are typically easily recoverable and the actual API session rarely shows errors once the connection is successful, but because boost client exit code does not distinguish this error type it's hard to manage retries with external tooling.

I propose adding a --max-retries option to manage e.g. exponential backoff retries for the initial API wss connection. We would be happy with ~5 retries taking place over a minute or two to avoid stampeding the endpoint.

dirkmc commented 2 years ago

I'd suggest we name the parameter --max-gateway-retries so it's clear it refers specifically to connecting to the gateway. It should have a default value of 0 (meaning, don't retry).

We can use the same backoff library that we use in the http transport. I'd suggest we hard-code the backoff parameters to:

    minBackOff: 1 second
    maxBackOff: 1 minute
    factor: 2

dirkmc commented 2 years ago

@parkan we looked at the code and it seems like this is not really straight-forward to implement. We make multiple calls to the API in different places, so we'd need to add retry logic to each one, which is a little complicated, and may not give consistent results. Can you describe your use case a little more? Maybe we can work out something simpler.

parkan commented 2 years ago

this is what I typically observe:

2022-09-28 14:25:57,177 INFO     initializing boost
Error: cant setup gateway connection: cannot dial address wss://api.chain.love/rpc/v1 for websocket: bad handshake: websocket: bad handshake
FATAL:  Command '['./boost', 'init']' returned non-zero exit status 1.
/app/docker_run.sh --rm --name maker-EOT-2016-20161219155331174-04681-04690-wbgrp-crawl005 --net=host --tmpfs /var/tmp/fast:rw,size=1500m,mode=1777 -v /t/derive/EOT-2016-20161219155331174-04681-04690-wbgrp-crawl005:/item -v /opt/.petabox/petabox-prod.xml:/opt/.petabox/petabox-prod.xml   -v /app/docker_run.sh:/app/docker_run.sh  -v /t/task/EOT-2016-20161219155331174-04681-04690-wbgrp-crawl005:/task   registry.archive.org/www/transmit-car-offline/production failed with exit code: 1

so the failure is always (or nearly always) during boost init inside the docker job (we init every time since the task runs on an arbitrary machine that has the data in question, in an ephemeral container), I have yet to see a subsequent call fail if init succeeds

as an alternative, I can try to guard that call specifically and/or regex the sterr for cannot dial to retry in my script (currently any non-0 exit code from boost aborts the entire task)

or, do you think the init call specifically is more susceptible to connection errors for some reason? I could also try to serialize the boost repo and mount it to the workers from hashicorp vault but so far I have seen no upside to that approach as initializing and importing keys is easy and cheap

dirkmc commented 2 years ago

I see so it's just when it's setting up the web socket connection. It should be more straight-forward for us to set up retries around that particular piece of code. First I'd like to confirm that this is being caused by gateway flakiness. Could you please try implementing the regex + retry and confirm that it fixes the problem. If so then we'll implement the retry option.

parkan commented 2 years ago

hmm so I am actually also seeing "failed to get account key" errors (https://github.com/filecoin-project/boost/blob/6e0ac5c56fcecac9938bdf3bea2444c6bffaf20e/storagemarket/lp2pimpl/net.go#L288) when getting deal status from miners at approx 3-5% rate, this seems to correlate to these errors but not 1:1

filecoin-project / boost

Add --max-retries parameter for API connection in boost client #844