Closed eric-czech closed 3 years ago
The broken-pipe certainly sounds like it should be retried, and it's odd that for this case we don't know. Did you not even get an HTTP status code? There's probably a longer list of specific errors that should be considered retriable.
Note that dask allows you to retry whole failed tasks, and that might be a good safeguard against weird intermittent problems on a small number of tasks in a large graph. Of course, we'd like to get it fixed anyway.
Did you not even get an HTTP status code?
I didn't and here's the full log for reference where snakemake is running a script that uses local dask: Log
Building DAG of jobs...
Creating conda environment envs/gwas.yaml...
Downloading and installing remote packages.
Environment for envs/gwas.yaml created (location: .snakemake/conda/0a479a2e)
Using shell: /bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 bgen_to_zarr
1
Select jobs to execute...
[Fri Dec 4 12:37:26 2020] rule bgen_to_zarr: input: rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen, rs-ukb/raw/gt-imputation/ukb_mfi_chr15_v3.txt, rs-ukb/raw/gt-imputation/ukb59384_imp_chr15_v3_s487296.sample output: rs-ukb/prep/gt-imputation/ukb_chr15.ckpt jobid: 0 wildcards: bgen_contig=15 threads: 63 resources: mem_mb=222822
Downloading from remote: rs-ukb/raw/gt-imputation/ukb59384_imp_chr15_v3_s487296.sample Finished download. Downloading from remote: rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen Finished download. Downloading from remote: rs-ukb/raw/gt-imputation/ukb_mfi_chr15_v3.txt Finished download. Activating conda environment: /workdir/.snakemake/conda/0a479a2e 2020-12-04 12:51:13,013 | main | INFO | Loading BGEN dataset for contig Contig(name=15, index=14) from rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen (chunks = (250, -1)) 2020-12-04 12:52:04,036 | main | INFO | Rechunking dataset for contig Contig(name=15, index=14) to gs://rs-ukb/prep/gt-imputation/ukb_chr15.zarr (chunks = (5216, 5792)):
Is there a way to do that when working through Xarray
I'm not sure - it's an optional argument to .compute()
.
I saw this twice today outside of the context of Dask, once in Xarray as logged in https://github.com/pydata/xarray/issues/4704 and again when called from Pandas:
Traceback (most recent call last):
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
self.validate_response(status, contents, json, path, headers)
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
raise HttpError(error)
gcsfs.utils.HttpError: Required
Traceback (most recent call last):
File "scripts/gwas.py", line 438, in <module>
fire.Fire()
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "scripts/gwas.py", line 401, in run_gwas
df.to_parquet(path)
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/util/_decorators.py", line 214, in wrapper
return func(*args, **kwargs)
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/core/frame.py", line 2109, in to_parquet
to_parquet(
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/io/parquet.py", line 258, in to_parquet
return impl.write(
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/io/parquet.py", line 112, in write
self.api.parquet.write_table(
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pyarrow/parquet.py", line 1733, in write_table
writer.write_table(table, row_group_size=row_group_size)
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pyarrow/parquet.py", line 591, in write_table
self.writer.write_table(table, row_group_size=row_group_size)
File "pyarrow/_parquet.pyx", line 1433, in pyarrow._parquet.ParquetWriter.write_table
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/spec.py", line 1352, in write
self.flush()
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/spec.py", line 1387, in flush
self._initiate_upload()
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1402, in _initiate_upload
self.location = sync(
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
raise exc.with_traceback(tb)
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
result[0] = await future
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1486, in initiate_upload
headers, _ = await fs._call(
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
raise e
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
self.validate_response(status, contents, json, path, headers)
File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
raise HttpError(error)
gcsfs.utils.HttpError: Required
I mention it because of the 3 errors I linked to in https://github.com/pydata/xarray/issues/4704, this one appears to be the most prevalent (and difficult to work around).
Can you please apply the following and see if you get extra information?
--- a/gcsfs/core.py
+++ b/gcsfs/core.py
@@ -1299,7 +1299,7 @@ class GCSFileSystem(AsyncFileSystem):
elif "invalid" in str(msg):
raise ValueError("Bad Request: %s\n%s" % (path, msg))
elif error:
- raise HttpError(error)
+ raise HttpError({"code": status, "message": error})
elif status:
raise HttpError({"code": status})
(I can push this to a branch, if that helps with installation)
The only mention I can find of something similar is this , where
required: The API request is missing required information. The required information could be a parameter or resource property.
which is not very descriptive.
Indeed, why not add the following for completeness
--- a/gcsfs/core.py
+++ b/gcsfs/core.py
@@ -1290,6 +1290,7 @@ class GCSFileSystem(AsyncFileSystem):
# TODO: limit to appropriate exceptions
msg = content
+ logger.debug("Error condition: %s" % ((status, content, json, path, headers), ))
if status == 404:
raise FileNotFoundError
elif status == 403:
Thanks @martindurant, I patched my client environment and will post anything that gets caught here.
I'm getting this same gcsfs.utils.HttpError: Required
error with long-running zarr writes to GCS. The errors are common enough that jobs with ~100k chunks usually fail. I'll add these debug lines tomorrow and try running again to see if there is any more detail on the HttpError.
For Dask retries you may want to try the dask.annotate function with the retries= keyword
This will require the latest release I think.
On Wed, Dec 30, 2020, 7:22 PM Sam Levang notifications@github.com wrote:
I'm getting this same gcsfs.utils.HttpError: Required error with long-running zarr writes to GCS. The errors are common enough that jobs with ~100k chunks usually fail. I'll add these debug lines tomorrow and try running again to see if there is any more detail on the HttpError.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/gcsfs/issues/316#issuecomment-752829230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBIX532PP5ZVA5FU7DSXPU6BANCNFSM4UNUXDPA .
Actually, it may still be in a PR.
On Wed, Dec 30, 2020, 8:48 PM Matthew Rocklin mrocklin@gmail.com wrote:
For Dask retries you may want to try the dask.annotate function with the retries= keyword
This will require the latest release I think.
On Wed, Dec 30, 2020, 7:22 PM Sam Levang notifications@github.com wrote:
I'm getting this same gcsfs.utils.HttpError: Required error with long-running zarr writes to GCS. The errors are common enough that jobs with ~100k chunks usually fail. I'll add these debug lines tomorrow and try running again to see if there is any more detail on the HttpError.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/gcsfs/issues/316#issuecomment-752829230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBIX532PP5ZVA5FU7DSXPU6BANCNFSM4UNUXDPA .
https://github.com/dask/distributed/pull/4347/
On Wed, Dec 30, 2020, 8:49 PM Matthew Rocklin mrocklin@gmail.com wrote:
Actually, it may still be in a PR.
On Wed, Dec 30, 2020, 8:48 PM Matthew Rocklin mrocklin@gmail.com wrote:
For Dask retries you may want to try the dask.annotate function with the retries= keyword
This will require the latest release I think.
On Wed, Dec 30, 2020, 7:22 PM Sam Levang notifications@github.com wrote:
I'm getting this same gcsfs.utils.HttpError: Required error with long-running zarr writes to GCS. The errors are common enough that jobs with ~100k chunks usually fail. I'll add these debug lines tomorrow and try running again to see if there is any more detail on the HttpError.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/gcsfs/issues/316#issuecomment-752829230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBIX532PP5ZVA5FU7DSXPU6BANCNFSM4UNUXDPA .
Here's the traceback:
File "/root/miniconda3/lib/python3.8/site-packages/xarray/core/dataset.py", line 1745, in to_zarr
return to_zarr(
File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/api.py", line 1482, in to_zarr
writes = writer.sync(compute=compute)
File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/common.py", line 155, in sync
delayed_store = da.store(
File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 945, in store
result.compute(**kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 167, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 452, in compute
results = schedule(dsk, keys, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
results = get_async(
File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
raise_exception(exc, tb)
File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 316, in reraise
raise exc
File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
result = _execute_task(task, data)
File "/root/miniconda3/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3632, in store_chunk
return load_store_chunk(x, out, index, lock, return_stored, False)
File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3621, in load_store_chunk
out[index] = np.asanyarray(x)
File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1122, in __setitem__
self.set_basic_selection(selection, value, fields=fields)
File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1217, in set_basic_selection
return self._set_basic_selection_nd(selection, value, fields=fields)
File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1508, in _set_basic_selection_nd
self._set_selection(indexer, value, fields=fields)
File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1580, in _set_selection
self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1709, in _chunk_setitems
self.chunk_store.setitems({k: v for k, v in zip(ckeys, cdatas)})
File "/root/miniconda3/lib/python3.8/site-packages/fsspec/mapping.py", line 110, in setitems
self.fs.pipe(values)
File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
return maybe_sync(func, self, *args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
return sync(loop, func, *args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
raise exc.with_traceback(tb)
File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
result[0] = await future
File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 222, in _pipe
await asyncio.gather(
File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1007, in _pipe_file
return await simple_upload(
File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1524, in simple_upload
j = await fs._call(
File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
raise e
File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
self.validate_response(status, contents, json, path, headers)
File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1229, in validate_response
raise HttpError({"code": status, "message": error})
gcsfs.utils.HttpError: {'code': 400, 'message': 'Required', 'errors': [{'message': 'Required', 'domain': 'global', 'reason': 'required'}]}
The docs mentioned above are pretty cryptic so it's hard to know what is going wrong, but for now could this 400 be added to is_retriable
in some form?
could this 400 be added to is_retriable
Yes, I suppose we can allow 400 in general or specifically look for this weird response. I hope it is indeed intermittent.
We have been running into somewhat the same error (#323 ) and were able to get some more information. We have been running our script for the last few hours and were able to extract some more information about the error that we were seeing. This code ran on a GCE VM.
The following errors happened consecutively:
2021-01-07 13:05:17,955 [validate_response] ERROR - Error condition: (504, b'', None, '<< REDACTED LINK TO FILE >>', <CIMultiDictProxy('X-GUploader-UploadID': '<< REDACTED >>', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Thu, 07 Jan 2021 13:05:17 GMT', 'Vary': 'Origin', 'Vary': 'X-Origin', 'Expires': 'Thu, 07 Jan 2021 13:05:17 GMT', 'Cache-Control': 'private, max-age=0', 'Content-Length': '0', 'Server': 'UploadServer')>)
2021-01-07 13:05:19,237 [validate_response] ERROR - Error condition: (401, b'Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.', None, '<< REDACTED LINK TO SAME FILE >>', <CIMultiDictProxy('X-GUploader-UploadID': '<< REDACTED >>', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Thu, 07 Jan 2021 13:05:19 GMT', 'Vary': 'Origin', 'Vary': 'X-Origin', 'WWW-Authenticate': 'Bearer realm="https://accounts.google.com/"', 'Expires': 'Thu, 07 Jan 2021 13:05:19 GMT', 'Cache-Control': 'private, max-age=0', 'Content-Length': '93', 'Server': 'UploadServer')>)
First, a gateway timeout happened (status: 504) after which the _call
method retried and a 401
status code was received.
Interestingly the VM does have access to the data. Thus the 401 status seems to be invalid in our case.
A possible solution would be to make the number of retries configurable (in order to mitigate the risk of reach API call limits) and retry on 401 as well? An exponential backoff would also be preferable.
How about: if the initial error (504) seems to be retriable, then we continue retrying whatever the subsequent errors? Could you give that a try to see if it does the business - or maybe the first error changes something fundamental and there needs to be a deeper kind of reset.
I am seeing similar issues, but also see a KeyError for a missing DataArray (which I verified is on disk as part of a dataset) bubbling up through xarray/zarr. I'm leaving this info here in case it helps or is related. I grabbed HEAD, which is formatting the HttpErrors a little better and shows the status code in the traceback. Some of my jobs hit the 401, while others hit the KeyError. I don't see a 504. Both can be remedied by enough retries of a workload.
@chrisroat , was there again no message with the HTTP errors? The KeyError is not a surprise, that is a valid outcome from zarr when it seems like the file in question can't be reached.
Could one of the people on this thread please aggregate what we have learned into a PR, which either recognises the errors encountered here and they become retriable; or else retries all errors by default except for a specific list that we know are not retriable?
@martindurant The HttpErrors only have the status code passed in.
My errors are 401 or KeyError. In general, I don't think a 401 should be retried if there is a chance it's real, and a token has expired? In the @DPGrev, it seems like the gateway error has triggered something, and it's possible the retries will all be 401s. @DPGrev, it is possible for you to check by using HEAD and adding 401 to the list in this function? Also, note that the back-off is exponential already.
For the KeyError case, it means there was no underlying error that was caught -- it seems tricky to remedy that if it is at the GCS layer.
I am getting the same error as was posted at Dec 31 in this thread (https://github.com/dask/gcsfs/issues/316#issuecomment-752994837), with a similar traceback and the 'Required: 400'. I understand this might have been solved by https://github.com/dask/gcsfs/pull/335, so I'll try using gcsfs head now instead of 0.7.2. Though not getting the error is no proof that it is solved (it is quite intermittent, and random), I'll post back here with my results.
Just noticed the fix in #380 so I tried out a test run with the latest changes. The good news is that did seem to fix the 400 errors. I made it through 1.7TB of a 2+TB dataset write (much further than before), but the bad news is I ran into another failure:
File "/root/miniconda3/lib/python3.8/site-packages/xarray/core/dataset.py", line 1790, in to_zarr
return to_zarr(
File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/api.py", line 1476, in to_zarr
writes = writer.sync(compute=compute)
File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/common.py", line 156, in sync
delayed_store = da.store(
File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 1041, in store
result.compute(**kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 283, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 565, in compute
results = schedule(dsk, keys, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
results = get_async(
File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 487, in get_async
raise_exception(exc, tb)
File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 317, in reraise
raise exc
File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
result = _execute_task(task, data)
File "/root/miniconda3/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3975, in store_chunk
return load_store_chunk(x, out, index, lock, return_stored, False)
File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3962, in load_store_chunk
out[index] = x
File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1211, in __setitem__
self.set_basic_selection(selection, value, fields=fields)
File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1306, in set_basic_selection
return self._set_basic_selection_nd(selection, value, fields=fields)
File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1597, in _set_basic_selection_nd
self._set_selection(indexer, value, fields=fields)
File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1669, in _set_selection
self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1861, in _chunk_setitems
self.chunk_store.setitems(values)
File "/root/miniconda3/lib/python3.8/site-packages/zarr/storage.py", line 1068, in setitems
self.map.setitems(values)
File "/root/miniconda3/lib/python3.8/site-packages/fsspec/mapping.py", line 111, in setitems
self.fs.pipe(values)
File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 72, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 53, in sync
raise result[0]
File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 20, in _runner
result[0] = await coro
File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 178, in _pipe
await asyncio.gather(
File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1076, in _pipe_file
location = await initiate_upload(self, bucket, key, content_type, metadata)
File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1586, in initiate_upload
headers, _ = await fs._call(
File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 531, in _call
status, headers, info, contents = await self._request(
File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 505, in _request
async with self.session.request(
File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in __aenter__
self._resp = await self._coro
File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request
await resp.start(conn)
File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start
message, payload = await self._protocol.read() # type: ignore
File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read
await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
Would you say that this is another error that we should retry for? There are quite a few exceptions in aiohttp.clent_errors
:
aiohttp.client_exceptions.ClientConnectionError
aiohttp.client_exceptions.ClientConnectorCertificateError
aiohttp.client_exceptions.ClientConnectorError
aiohttp.client_exceptions.ClientConnectorSSLError
aiohttp.client_exceptions.ClientError
aiohttp.client_exceptions.ClientHttpProxyError
aiohttp.client_exceptions.ClientOSError
aiohttp.client_exceptions.ClientPayloadError
aiohttp.client_exceptions.ClientProxyConnectionError
aiohttp.client_exceptions.ClientResponse
aiohttp.client_exceptions.ClientResponseError
aiohttp.client_exceptions.ClientSSLError
aiohttp.client_exceptions.ContentTypeError
aiohttp.client_exceptions.ServerConnectionError
aiohttp.client_exceptions.ServerDisconnectedError
aiohttp.client_exceptions.ServerTimeoutError
aiohttp.client_exceptions.WSServerHandshakeError```
This was definitely intermittent since the job had already been running a couple hours, so I think retrying is appropriate. It seems hard to pick through these and figure out which should actually should be retried, so another approach would be to just retry all with a reasonable limit on total retries?
Yes, I tend to agree with you, after having previously resisted the idea. There are just too many fail cases! The list is here, so this could be edited to add aiohttp.client_exceptions.ClientError
, which seems to be the superclass of all of the others.
Please feel free to put that in a PR :)
Can do. I'll test it out again with this workload and then make a PR.
Awesome, with #380 and #385 I'm now able to smoothly write multi-terabyte zarr arrays direct to GCS.
Similar to https://github.com/dask/gcsfs/issues/315, I saw this today after 10s of GB of data had already been written to a Zarr archive (i.e. this isn't a problem with initial writes, it appears to be something spurious in long-running jobs):
Any ideas what this could be or if it should be caught/retried somewhere?
gcsfs
version: 0.7.1