Closed yarikoptic closed 1 year ago
can also be 503:
aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable', url=URL('https://dandiarchive.s3.amazonaws.com/blobs/2c6/d85/2c6d85f8-b6ba-4481-bb1d-fd001519de59?response-content-disposition=attachment%3B%20filename%3D%22sub-803390283_ses-831882777_probe-832810573_ecephys.nwb%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E/20230207/us-east-2/s3/aws4_request&X-Amz-Date=20230207T112003Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=56f985b4d855fc455bd727826e2c1dfef57a6b44d57b795d0f04682c38ccfd48')
so we might want to retry on any 5xx may be.
@yarikoptic I've added basic retrying using aiohttp-retry, but there's a wrinkle in getting it to retry the desired number of times: before fetching any data, fsspec tries to get the file size via a HEAD request, and then a GET request if that fails (and then if either succeeded it makes another request to get the actual data). Hence, we may end up retrying twice the expected number of retries if the server is having serious problems. Given that, exactly how many retries (and with what delays) should there be?
Thank you for digging into it! If it was up to me, I would retry on known to be server issues for very extended periods of time as long as that does not cause some timeouts "up the stack". Hence here I think it is totally fine to try twice the expected number of times as to me: we reach some balance in number / duration so we do not "stall indefinitely" and our jobs such as dandisets-healthcheck proceed without false positives due to IO issues from the FUSE level.
it seems that in the course of using for
dandisets
we are getting to frequently triggering 500 from AWS:which look like
IMHO we should make datalad-fuse more robust to intermittent issues and either do our own retries or instruct aiohttp to retry in some cases, in particular upon http 500, do retry e.g. up to 5 times spreading timing for up to a minute or so.