HttpError with no message halfway through large GS write workload

eric-czech commented 3 years ago

Similar to https://github.com/dask/gcsfs/issues/315, I saw this today after 10s of GB of data had already been written to a Zarr archive (i.e. this isn't a problem with initial writes, it appears to be something spurious in long-running jobs):

Traceback (most recent call last):
  File "scripts/convert_genetic_data.py", line 312, in <module>
    fire.Fire()
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "scripts/convert_genetic_data.py", line 296, in bgen_to_zarr
    ds = rechunk_dataset(
  File "scripts/convert_genetic_data.py", line 217, in rechunk_dataset
    res = fn(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/sgkit/io/bgen/bgen_reader.py", line 519, in rechunk_bgen
    rechunked.execute()
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/rechunker/api.py", line 76, in execute
    self._executor.execute_plan(self._plan, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/rechunker/executors/dask.py", line 24, in execute_plan
    return plan.compute(**kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/base.py", line 167, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/base.py", line 452, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
    results = get_async(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
    raise_exception(exc, tb)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 316, in reraise
    raise exc
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/array/core.py", line 3724, in store_chunk
    return load_store_chunk(x, out, index, lock, return_stored, False)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/array/core.py", line 3713, in load_store_chunk
    out[index] = np.asanyarray(x)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1115, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1210, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1501, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1550, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1664, in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1729, in _chunk_setitem_nosync
    self.chunk_store[ckey] = cdata
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/mapping.py", line 154, in __setitem__
    self.fs.pipe_file(key, value)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
    return maybe_sync(func, self, *args, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1007, in _pipe_file
    return await simple_upload(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1523, in simple_upload
    j = await fs._call(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
    raise e
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
    raise HttpError(error)
gcsfs.utils.HttpError: Required

Any ideas what this could be or if it should be caught/retried somewhere?

gcsfs version: 0.7.1

martindurant commented 3 years ago

The broken-pipe certainly sounds like it should be retried, and it's odd that for this case we don't know. Did you not even get an HTTP status code? There's probably a longer list of specific errors that should be considered retriable.

Note that dask allows you to retry whole failed tasks, and that might be a good safeguard against weird intermittent problems on a small number of tasks in a large graph. Of course, we'd like to get it fixed anyway.

eric-czech commented 3 years ago

Did you not even get an HTTP status code?

I didn't and here's the full log for reference where snakemake is running a script that uses local dask:

Log

Building DAG of jobs... Creating conda environment envs/gwas.yaml... Downloading and installing remote packages. Environment for envs/gwas.yaml created (location: .snakemake/conda/0a479a2e) Using shell: /bin/bash Provided cores: 64 Rules claiming more threads will be scaled down. Job counts: count jobs 1 bgen_to_zarr 1 Select jobs to execute...

[Fri Dec 4 12:37:26 2020] rule bgen_to_zarr: input: rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen, rs-ukb/raw/gt-imputation/ukb_mfi_chr15_v3.txt, rs-ukb/raw/gt-imputation/ukb59384_imp_chr15_v3_s487296.sample output: rs-ukb/prep/gt-imputation/ukb_chr15.ckpt jobid: 0 wildcards: bgen_contig=15 threads: 63 resources: mem_mb=222822

Downloading from remote: rs-ukb/raw/gt-imputation/ukb59384_imp_chr15_v3_s487296.sample Finished download. Downloading from remote: rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen Finished download. Downloading from remote: rs-ukb/raw/gt-imputation/ukb_mfi_chr15_v3.txt Finished download. Activating conda environment: /workdir/.snakemake/conda/0a479a2e 2020-12-04 12:51:13,013 | main | INFO | Loading BGEN dataset for contig Contig(name=15, index=14) from rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen (chunks = (250, -1)) 2020-12-04 12:52:04,036 | main | INFO | Rechunking dataset for contig Contig(name=15, index=14) to gs://rs-ukb/prep/gt-imputation/ukb_chr15.zarr (chunks = (5216, 5792)):

Dimensions: (alleles: 2, genotypes: 3, samples: 487409, variants: 2767971) Dimensions without coordinates: alleles, genotypes, samples, variants Data variables: variant_id (variants) |S186 dask.array variant_rsid (variants) |S132 dask.array variant_position (variants) int32 dask.array variant_maf (variants) float32 dask.array variant_minor_allele (variants) |S118 dask.array variant_info (variants) float32 dask.array variant_allele (variants, alleles) |S172 dask.array sample_id (samples) int32 4476413 3205773 ... 4315851 sample_sex (samples) uint8 1 2 2 2 2 1 ... 1 2 2 1 2 1 variant_contig (variants) int64 14 14 14 14 ... 14 14 14 14 variant_contig_name (variants) |S2 b'15' b'15' ... b'15' b'15' call_genotype_probability (variants, samples, genotypes) float16 dask.array call_genotype_probability_mask (variants, samples, genotypes) bool dask.array Attributes: contigs: ['15'] contig_name: 15 contig_index: 14 [#################### ] | 51% Completed | 2hr 8min 8.3s Traceback (most recent call last): File "scripts/convert_genetic_data.py", line 312, in fire.Fire() File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "scripts/convert_genetic_data.py", line 296, in bgen_to_zarr ds = rechunk_dataset( File "scripts/convert_genetic_data.py", line 217, in rechunk_dataset res = fn( File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/sgkit/io/bgen/bgen_reader.py", line 519, in rechunk_bgen rechunked.execute() File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/rechunker/api.py", line 76, in execute self._executor.execute_plan(self._plan, **kwargs) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/rechunker/executors/dask.py", line 24, in execute_plan return plan.compute(**kwargs) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/base.py", line 167, in compute (result,) = compute(self, traverse=False, **kwargs) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/base.py", line 452, in compute results = schedule(dsk, keys, **kwargs) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/threaded.py", line 76, in get results = get_async( File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 486, in get_async raise_exception(exc, tb) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 316, in reraise raise exc File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task result = _execute_task(task, data) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task return func(*(_execute_task(a, cache) for a in args)) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/array/core.py", line 3724, in store_chunk return load_store_chunk(x, out, index, lock, return_stored, False) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/array/core.py", line 3713, in load_store_chunk out[index] = np.asanyarray(x) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1115, in __setitem__ self.set_basic_selection(selection, value, fields=fields) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1210, in set_basic_selection return self._set_basic_selection_nd(selection, value, fields=fields) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1501, in _set_basic_selection_nd self._set_selection(indexer, value, fields=fields) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1550, in _set_selection self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1664, in _chunk_setitem self._chunk_setitem_nosync(chunk_coords, chunk_selection, value, File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1729, in _chunk_setitem_nosync self.chunk_store[ckey] = cdata File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/mapping.py", line 154, in __setitem__ self.fs.pipe_file(key, value) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper return maybe_sync(func, self, *args, **kwargs) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync return sync(loop, func, *args, **kwargs) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync raise exc.with_traceback(tb) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f result[0] = await future File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1007, in _pipe_file return await simple_upload( File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1523, in simple_upload j = await fs._call( File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call raise e File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call self.validate_response(status, contents, json, path, headers) File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response raise HttpError(error) gcsfs.utils.HttpError: Required [Fri Dec 4 15:00:54 2020] Error in rule bgen_to_zarr: jobid: 0 output: rs-ukb/prep/gt-imputation/ukb_chr15.ckpt conda-env: /workdir/.snakemake/conda/0a479a2e shell: python scripts/convert_genetic_data.py bgen_to_zarr --input-path-bgen=rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen --input-path-variants=rs-ukb/raw/gt-imputation/ukb_mfi_chr15_v3.txt --input-path-samples=rs-ukb/raw/gt-imputation/ukb59384_imp_chr15_v3_s487296.sample --output-path=gs://rs-ukb/prep/gt-imputation/ukb_chr15.zarr --contig-name=15 --contig-index=14 --remote=True && touch rs-ukb/prep/gt-imputation/ukb_chr15.ckpt (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!) Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

> Note that dask allows you to retry whole failed tasks Is there a way to do that when working through Xarray? Or is there some global dask distributed property that would control that? I looked at one point and was a little confused as to how that's supposed to work and whether or not it's safe with IO operations like this.

martindurant commented 3 years ago

Is there a way to do that when working through Xarray

I'm not sure - it's an optional argument to .compute().

eric-czech commented 3 years ago

I saw this twice today outside of the context of Dask, once in Xarray as logged in https://github.com/pydata/xarray/issues/4704 and again when called from Pandas:

Traceback (most recent call last):
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
    raise HttpError(error)
gcsfs.utils.HttpError: Required
Traceback (most recent call last):
  File "scripts/gwas.py", line 438, in <module>
    fire.Fire()
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "scripts/gwas.py", line 401, in run_gwas
    df.to_parquet(path)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/util/_decorators.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/core/frame.py", line 2109, in to_parquet
    to_parquet(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/io/parquet.py", line 258, in to_parquet
    return impl.write(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/io/parquet.py", line 112, in write
    self.api.parquet.write_table(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pyarrow/parquet.py", line 1733, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pyarrow/parquet.py", line 591, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 1433, in pyarrow._parquet.ParquetWriter.write_table
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/spec.py", line 1352, in write
    self.flush()
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/spec.py", line 1387, in flush
    self._initiate_upload()
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1402, in _initiate_upload
    self.location = sync(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1486, in initiate_upload
    headers, _ = await fs._call(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
    raise e
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
    raise HttpError(error)
gcsfs.utils.HttpError: Required

I mention it because of the 3 errors I linked to in https://github.com/pydata/xarray/issues/4704, this one appears to be the most prevalent (and difficult to work around).

martindurant commented 3 years ago

Can you please apply the following and see if you get extra information?

--- a/gcsfs/core.py
+++ b/gcsfs/core.py
@@ -1299,7 +1299,7 @@ class GCSFileSystem(AsyncFileSystem):
             elif "invalid" in str(msg):
                 raise ValueError("Bad Request: %s\n%s" % (path, msg))
             elif error:
-                raise HttpError(error)
+                raise HttpError({"code": status, "message": error})
             elif status:
                 raise HttpError({"code": status})

(I can push this to a branch, if that helps with installation)

martindurant commented 3 years ago

The only mention I can find of something similar is this , where

required: The API request is missing required information. The required information could be a parameter or resource property.

which is not very descriptive.

martindurant commented 3 years ago

Indeed, why not add the following for completeness

--- a/gcsfs/core.py
+++ b/gcsfs/core.py
@@ -1290,6 +1290,7 @@ class GCSFileSystem(AsyncFileSystem):
                 # TODO: limit to appropriate exceptions
                 msg = content

+            logger.debug("Error condition: %s" % ((status, content, json, path, headers), ))
             if status == 404:
                 raise FileNotFoundError
             elif status == 403:

eric-czech commented 3 years ago

Thanks @martindurant, I patched my client environment and will post anything that gets caught here.

slevang commented 3 years ago

I'm getting this same gcsfs.utils.HttpError: Required error with long-running zarr writes to GCS. The errors are common enough that jobs with ~100k chunks usually fail. I'll add these debug lines tomorrow and try running again to see if there is any more detail on the HttpError.

mrocklin commented 3 years ago

For Dask retries you may want to try the dask.annotate function with the retries= keyword

This will require the latest release I think.

On Wed, Dec 30, 2020, 7:22 PM Sam Levang notifications@github.com wrote:

I'm getting this same gcsfs.utils.HttpError: Required error with long-running zarr writes to GCS. The errors are common enough that jobs with ~100k chunks usually fail. I'll add these debug lines tomorrow and try running again to see if there is any more detail on the HttpError.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/gcsfs/issues/316#issuecomment-752829230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBIX532PP5ZVA5FU7DSXPU6BANCNFSM4UNUXDPA .

mrocklin commented 3 years ago

Actually, it may still be in a PR.

On Wed, Dec 30, 2020, 8:48 PM Matthew Rocklin mrocklin@gmail.com wrote:

For Dask retries you may want to try the dask.annotate function with the retries= keyword

This will require the latest release I think.

On Wed, Dec 30, 2020, 7:22 PM Sam Levang notifications@github.com wrote:

I'm getting this same gcsfs.utils.HttpError: Required error with long-running zarr writes to GCS. The errors are common enough that jobs with ~100k chunks usually fail. I'll add these debug lines tomorrow and try running again to see if there is any more detail on the HttpError.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/gcsfs/issues/316#issuecomment-752829230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBIX532PP5ZVA5FU7DSXPU6BANCNFSM4UNUXDPA .

mrocklin commented 3 years ago

https://github.com/dask/distributed/pull/4347/

On Wed, Dec 30, 2020, 8:49 PM Matthew Rocklin mrocklin@gmail.com wrote:

Actually, it may still be in a PR.

On Wed, Dec 30, 2020, 8:48 PM Matthew Rocklin mrocklin@gmail.com wrote:

For Dask retries you may want to try the dask.annotate function with the retries= keyword

This will require the latest release I think.

On Wed, Dec 30, 2020, 7:22 PM Sam Levang notifications@github.com wrote:

I'm getting this same gcsfs.utils.HttpError: Required error with long-running zarr writes to GCS. The errors are common enough that jobs with ~100k chunks usually fail. I'll add these debug lines tomorrow and try running again to see if there is any more detail on the HttpError.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/gcsfs/issues/316#issuecomment-752829230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBIX532PP5ZVA5FU7DSXPU6BANCNFSM4UNUXDPA .

slevang commented 3 years ago

Here's the traceback:

  File "/root/miniconda3/lib/python3.8/site-packages/xarray/core/dataset.py", line 1745, in to_zarr
    return to_zarr(
  File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/api.py", line 1482, in to_zarr
    writes = writer.sync(compute=compute)
  File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/common.py", line 155, in sync
    delayed_store = da.store(
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 945, in store
    result.compute(**kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 167, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 452, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
    results = get_async(
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
    raise_exception(exc, tb)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 316, in reraise
    raise exc
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3632, in store_chunk
    return load_store_chunk(x, out, index, lock, return_stored, False)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3621, in load_store_chunk
    out[index] = np.asanyarray(x)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1122, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1217, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1508, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1580, in _set_selection
    self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1709, in _chunk_setitems
    self.chunk_store.setitems({k: v for k, v in zip(ckeys, cdatas)})
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/mapping.py", line 110, in setitems
    self.fs.pipe(values)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
    return maybe_sync(func, self, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 222, in _pipe
    await asyncio.gather(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1007, in _pipe_file
    return await simple_upload(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1524, in simple_upload
    j = await fs._call(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
    raise e
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1229, in validate_response
    raise HttpError({"code": status, "message": error})
gcsfs.utils.HttpError: {'code': 400, 'message': 'Required', 'errors': [{'message': 'Required', 'domain': 'global', 'reason': 'required'}]}

The docs mentioned above are pretty cryptic so it's hard to know what is going wrong, but for now could this 400 be added to is_retriable in some form?

martindurant commented 3 years ago

could this 400 be added to is_retriable

Yes, I suppose we can allow 400 in general or specifically look for this weird response. I hope it is indeed intermittent.

DPGrev commented 3 years ago

We have been running into somewhat the same error (#323 ) and were able to get some more information. We have been running our script for the last few hours and were able to extract some more information about the error that we were seeing. This code ran on a GCE VM.

The following errors happened consecutively:

2021-01-07 13:05:17,955 [validate_response] ERROR - Error condition: (504, b'', None, '<< REDACTED LINK TO FILE >>', <CIMultiDictProxy('X-GUploader-UploadID': '<< REDACTED >>', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Thu, 07 Jan 2021 13:05:17 GMT', 'Vary': 'Origin', 'Vary': 'X-Origin', 'Expires': 'Thu, 07 Jan 2021 13:05:17 GMT', 'Cache-Control': 'private, max-age=0', 'Content-Length': '0', 'Server': 'UploadServer')>)
2021-01-07 13:05:19,237 [validate_response] ERROR - Error condition: (401, b'Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.', None, '<< REDACTED LINK TO SAME FILE >>', <CIMultiDictProxy('X-GUploader-UploadID': '<< REDACTED >>', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Thu, 07 Jan 2021 13:05:19 GMT', 'Vary': 'Origin', 'Vary': 'X-Origin', 'WWW-Authenticate': 'Bearer realm="https://accounts.google.com/"', 'Expires': 'Thu, 07 Jan 2021 13:05:19 GMT', 'Cache-Control': 'private, max-age=0', 'Content-Length': '93', 'Server': 'UploadServer')>)

First, a gateway timeout happened (status: 504) after which the _call method retried and a 401 status code was received. Interestingly the VM does have access to the data. Thus the 401 status seems to be invalid in our case.

A possible solution would be to make the number of retries configurable (in order to mitigate the risk of reach API call limits) and retry on 401 as well? An exponential backoff would also be preferable.

martindurant commented 3 years ago

How about: if the initial error (504) seems to be retriable, then we continue retrying whatever the subsequent errors? Could you give that a try to see if it does the business - or maybe the first error changes something fundamental and there needs to be a deeper kind of reset.

chrisroat commented 3 years ago

I am seeing similar issues, but also see a KeyError for a missing DataArray (which I verified is on disk as part of a dataset) bubbling up through xarray/zarr. I'm leaving this info here in case it helps or is related. I grabbed HEAD, which is formatting the HttpErrors a little better and shows the status code in the traceback. Some of my jobs hit the 401, while others hit the KeyError. I don't see a 504. Both can be remedied by enough retries of a workload.

HttpError

``` Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/datajoint/autopopulate.py", line 159, in populate make(dict(key)) File "/opt/conda/lib/python3.8/site-packages/starmap/pipeline.py", line 1552, in make dask.compute(tasks) File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 561, in compute results = schedule(dsk, keys, **kwargs) File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 2681, in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1990, in gather return self.sync( File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 836, in sync return sync( File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync raise exc.with_traceback(tb) File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 324, in f result[0] = yield future File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1855, in _gather raise exception.with_traceback(traceback) File "/opt/conda/lib/python3.8/site-packages/dask/array/core.py", line 110, in getter c = np.asarray(c) File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray return array(a, dtype, copy=False, order=order) File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 503, in __array__ return np.asarray(self.array, dtype=dtype) File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray return array(a, dtype, copy=False, order=order) File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 663, in __array__ return np.asarray(self.array, dtype=dtype) File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray return array(a, dtype, copy=False, order=order) File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 568, in __array__ return np.asarray(array[self.key], dtype=None) File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 55, in __getitem__ array = self.get_array() File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 52, in get_array return self.datastore.ds[self.variable_name] File "/opt/conda/lib/python3.8/site-packages/zarr/hierarchy.py", line 341, in __getitem__ return Array(self._store, read_only=self._read_only, path=path, File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 123, in __init__ self._load_metadata() File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 140, in _load_metadata self._load_metadata_nosync() File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 149, in _load_metadata_nosync meta_bytes = self._store[mkey] File "/opt/conda/lib/python3.8/site-packages/fsspec/mapping.py", line 132, in __getitem__ result = self.fs.cat(k) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 241, in cat raise ex File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 887, in _cat_file headers, out = await self._call("GET", u2, headers=head) File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 548, in _call raise e File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 530, in _call self.validate_response(status, contents, json, path, headers) File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 1340, in validate_response raise HttpError({"code": status}) Exception: , 401 ```

KeyError

``` Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/datajoint/autopopulate.py", line 159, in populate make(dict(key)) File "/opt/conda/lib/python3.8/site-packages/starmap/pipeline.py", line 1552, in make dask.compute(tasks) File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 561, in compute results = schedule(dsk, keys, **kwargs) File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 2681, in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1990, in gather return self.sync( File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 836, in sync return sync( File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync raise exc.with_traceback(tb) File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 324, in f result[0] = yield future File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1855, in _gather raise exception.with_traceback(traceback) File "/opt/conda/lib/python3.8/site-packages/dask/array/core.py", line 110, in getter c = np.asarray(c) File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray return array(a, dtype, copy=False, order=order) File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 503, in __array__ return np.asarray(self.array, dtype=dtype) File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray return array(a, dtype, copy=False, order=order) File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 663, in __array__ return np.asarray(self.array, dtype=dtype) File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray return array(a, dtype, copy=False, order=order) File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 568, in __array__ return np.asarray(array[self.key], dtype=None) File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 55, in __getitem__ array = self.get_array() File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 52, in get_array return self.datastore.ds[self.variable_name] File "/opt/conda/lib/python3.8/site-packages/zarr/hierarchy.py", line 349, in __getitem__ raise KeyError(item) KeyError: 'image' ```

martindurant commented 3 years ago

@chrisroat , was there again no message with the HTTP errors? The KeyError is not a surprise, that is a valid outcome from zarr when it seems like the file in question can't be reached.

Could one of the people on this thread please aggregate what we have learned into a PR, which either recognises the errors encountered here and they become retriable; or else retries all errors by default except for a specific list that we know are not retriable?

chrisroat commented 3 years ago

@martindurant The HttpErrors only have the status code passed in.

My errors are 401 or KeyError. In general, I don't think a 401 should be retried if there is a chance it's real, and a token has expired? In the @DPGrev, it seems like the gateway error has triggered something, and it's possible the retries will all be 401s. @DPGrev, it is possible for you to check by using HEAD and adding 401 to the list in this function? Also, note that the back-off is exponential already.

For the KeyError case, it means there was no underlying error that was caught -- it seems tricky to remedy that if it is at the GCS layer.

vincentschut commented 3 years ago

I am getting the same error as was posted at Dec 31 in this thread (https://github.com/dask/gcsfs/issues/316#issuecomment-752994837), with a similar traceback and the 'Required: 400'. I understand this might have been solved by https://github.com/dask/gcsfs/pull/335, so I'll try using gcsfs head now instead of 0.7.2. Though not getting the error is no proof that it is solved (it is quite intermittent, and random), I'll post back here with my results.

slevang commented 3 years ago

Just noticed the fix in #380 so I tried out a test run with the latest changes. The good news is that did seem to fix the 400 errors. I made it through 1.7TB of a 2+TB dataset write (much further than before), but the bad news is I ran into another failure:

  File "/root/miniconda3/lib/python3.8/site-packages/xarray/core/dataset.py", line 1790, in to_zarr
    return to_zarr(
  File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/api.py", line 1476, in to_zarr
    writes = writer.sync(compute=compute)
  File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/common.py", line 156, in sync
    delayed_store = da.store(
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 1041, in store
    result.compute(**kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 283, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 565, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
    results = get_async(
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 487, in get_async
    raise_exception(exc, tb)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 317, in reraise
    raise exc
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3975, in store_chunk
    return load_store_chunk(x, out, index, lock, return_stored, False)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3962, in load_store_chunk
    out[index] = x
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1211, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1306, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1597, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1669, in _set_selection
    self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1861, in _chunk_setitems
    self.chunk_store.setitems(values)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/storage.py", line 1068, in setitems
    self.map.setitems(values)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/mapping.py", line 111, in setitems
    self.fs.pipe(values)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 72, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 53, in sync
    raise result[0]
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 20, in _runner
    result[0] = await coro
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 178, in _pipe
    await asyncio.gather(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1076, in _pipe_file
    location = await initiate_upload(self, bucket, key, content_type, metadata)
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1586, in initiate_upload
    headers, _ = await fs._call(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 531, in _call
    status, headers, info, contents = await self._request(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 505, in _request
    async with self.session.request(
  File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in __aenter__
    self._resp = await self._coro
  File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request
    await resp.start(conn)
  File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start
    message, payload = await self._protocol.read()  # type: ignore
  File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read
    await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected

martindurant commented 3 years ago

Would you say that this is another error that we should retry for? There are quite a few exceptions in aiohttp.clent_errors:


aiohttp.client_exceptions.ClientConnectionError
aiohttp.client_exceptions.ClientConnectorCertificateError
aiohttp.client_exceptions.ClientConnectorError
aiohttp.client_exceptions.ClientConnectorSSLError
aiohttp.client_exceptions.ClientError
aiohttp.client_exceptions.ClientHttpProxyError
aiohttp.client_exceptions.ClientOSError
aiohttp.client_exceptions.ClientPayloadError
aiohttp.client_exceptions.ClientProxyConnectionError
aiohttp.client_exceptions.ClientResponse
aiohttp.client_exceptions.ClientResponseError
aiohttp.client_exceptions.ClientSSLError
aiohttp.client_exceptions.ContentTypeError
aiohttp.client_exceptions.ServerConnectionError
aiohttp.client_exceptions.ServerDisconnectedError
aiohttp.client_exceptions.ServerTimeoutError
aiohttp.client_exceptions.WSServerHandshakeError```

slevang commented 3 years ago

This was definitely intermittent since the job had already been running a couple hours, so I think retrying is appropriate. It seems hard to pick through these and figure out which should actually should be retried, so another approach would be to just retry all with a reasonable limit on total retries?

martindurant commented 3 years ago

Yes, I tend to agree with you, after having previously resisted the idea. There are just too many fail cases! The list is here, so this could be edited to add aiohttp.client_exceptions.ClientError, which seems to be the superclass of all of the others.

martindurant commented 3 years ago

Please feel free to put that in a PR :)

slevang commented 3 years ago

Can do. I'll test it out again with this workload and then make a PR.

slevang commented 3 years ago

Awesome, with #380 and #385 I'm now able to smoothly write multi-terabyte zarr arrays direct to GCS.

fsspec / gcsfs

HttpError with no message halfway through large GS write workload #316