Add retries on aiohttp errors

yarikoptic commented 1 year ago

it seems that in the course of using for dandisets we are getting to frequently triggering 500 from AWS:

(base) dandi@drogon:~/cronlib/dandisets-healthstatus$ grep Error: fuse.log
aiohttp.client_exceptions.ClientResponseError: 500, message='Internal Server Error', url=URL('https://dandiarchive.s3.amazonaws.com/blobs/c86/cdf/c86cdfba-e1af-45a7-8dfd-d243adc20ced?response-content-disposition=attachment%3B%20filename%3D%22sub-YutaMouse33_ses-YutaMouse33-150222_behavior%2Becephys.nwb%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E/20230203/us-east-2/s3/aws4_request&X-Amz-Date=20230203T095518Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=d9f95b0377b232c60b0e296a10c418d2e7d43b9078103d46b88cc302f3043c9d')
aiohttp.client_exceptions.ClientResponseError: 500, message='Internal Server Error', url=URL('https://dandiarchive.s3.amazonaws.com/blobs/4bb/039/4bb039ef-9500-4124-b927-eedf1b1d53dd?response-content-disposition=attachment%3B%20filename%3D%22sub-YutaMouse33_ses-YutaMouse33-150225_behavior%2Becephys.nwb%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E/20230203/us-east-2/s3/aws4_request&X-Amz-Date=20230203T124214Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=b688128a142bf75f095d2bae4fb5d0ac70768a61242b8b46d69261cbda24b8e7')

which look like

Uncaught exception from FUSE operation read, returning errno.EINVAL.
Traceback (most recent call last):
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fuse.py", line 734, in _wrapper
    return func(*args, **kwargs) or 0
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fuse.py", line 845, in read
    ret = self.operations('read', self._decode_optional_path(path), size,
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/datalad_fuse/fuse_.py", line 75, in __call__
    return super(DataLadFUSE, self).__call__(op, self.root + path, *args)
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fuse.py", line 1076, in __call__
    return getattr(self, op)(*args)
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/datalad_fuse/fuse_.py", line 235, in read
    return f.read(size)
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 590, in read
    return super().read(length)
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fsspec/spec.py", line 1659, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fsspec/caching.py", line 101, in _fetch
    self.cache[sstart:send] = self.fetcher(sstart, send)
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 113, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 98, in sync
    raise return_result
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 53, in _runner
    result[0] = await coro
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 631, in async_fetch_range
    r.raise_for_status()
  File "/home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 500, message='Internal Server Error', url=URL('https://dandiarchive.s3.amazonaws.com/blobs/4bb/039/4bb039ef-9500-4124-b927-eedf1b1d53dd?response-content-disposition=attachment%3B%20filename%3D%22sub-YutaMouse33_ses-YutaMouse33-150225_behavior%2Becephys.nwb%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E/20230203/us-east-2/s3/aws4_request&X-Amz-Date=20230203T124214Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=b688128a142bf75f095d2bae4fb5d0ac70768a61242b8b46d69261cbda24b8e7')

IMHO we should make datalad-fuse more robust to intermittent issues and either do our own retries or instruct aiohttp to retry in some cases, in particular upon http 500, do retry e.g. up to 5 times spreading timing for up to a minute or so.

yarikoptic commented 1 year ago

can also be 503:

aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable', url=URL('https://dandiarchive.s3.amazonaws.com/blobs/2c6/d85/2c6d85f8-b6ba-4481-bb1d-fd001519de59?response-content-disposition=attachment%3B%20filename%3D%22sub-803390283_ses-831882777_probe-832810573_ecephys.nwb%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E/20230207/us-east-2/s3/aws4_request&X-Amz-Date=20230207T112003Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=56f985b4d855fc455bd727826e2c1dfef57a6b44d57b795d0f04682c38ccfd48')

so we might want to retry on any 5xx may be.

jwodder commented 1 year ago

@yarikoptic I've added basic retrying using aiohttp-retry, but there's a wrinkle in getting it to retry the desired number of times: before fetching any data, fsspec tries to get the file size via a HEAD request, and then a GET request if that fails (and then if either succeeded it makes another request to get the actual data). Hence, we may end up retrying twice the expected number of retries if the server is having serious problems. Given that, exactly how many retries (and with what delays) should there be?

yarikoptic commented 1 year ago

Thank you for digging into it! If it was up to me, I would retry on known to be server issues for very extended periods of time as long as that does not cause some timeouts "up the stack". Hence here I think it is totally fine to try twice the expected number of times as to me: we reach some balance in number / duration so we do not "stall indefinitely" and our jobs such as dandisets-healthcheck proceed without false positives due to IO issues from the FUSE level.

datalad / datalad-fuse

Add retries on aiohttp errors #93