cedadev / cmip6-object-store

CMIP6 Object Store Library
BSD 3-Clause "New" or "Revised" License
4 stars 4 forks source link

Wrap the entire "write-to-zarr" process in a try/except to overcome connection issues #29

Closed agstephens closed 4 years ago

agstephens commented 4 years ago

Connection issues I have seen so far:

INFO:/home/users/astephen/cmip6/cmip6-object-store/cmip6_object_store/cmip6_zarr/task.py:Chunks: Frozen(SortedKeysDict({'time': (120,), 'axis_nbounds': (2,), 'lat': (143,), 'lon': (144,)}))
Traceback (most recent call last):
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/fsspec/mapping.py", line 75, in __getitem__
    result = self.fs.cat(k)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/fsspec/spec.py", line 587, in cat
    return self.open(path, "rb").read()
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/fsspec/spec.py", line 775, in open
    **kwargs
  File "/home/users/astephen/.local/lib/python3.7/site-packages/s3fs/core.py", line 378, in _open
    autocommit=autocommit, requester_pays=requester_pays)
  File "/home/users/astephen/.local/lib/python3.7/site-packages/s3fs/core.py", line 1097, in __init__
    cache_type=cache_type)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/fsspec/spec.py", line 1065, in __init__
    self.details = fs.info(path)
  File "/home/users/astephen/.local/lib/python3.7/site-packages/s3fs/core.py", line 527, in info
    if self.version_aware or (key and self._ls_from_cache(path) is None) or refresh:
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/fsspec/spec.py", line 313, in _ls_from_cache
    raise FileNotFoundError(path)
FileNotFoundError: cmip6-test-8/371a44c8-cfab-4682-9b5f-ebbdd9266af1.zarr/.zgroup

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/zarr/hierarchy.py", line 113, in __init__
    meta_bytes = store[mkey]
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/fsspec/mapping.py", line 79, in __getitem__
    raise KeyError(key)
KeyError: '.zgroup'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cmip6_object_store/cmip6_zarr/cli.py", line 160, in <module>
    sys.exit(main())  # pragma: no cover
  File "cmip6_object_store/cmip6_zarr/cli.py", line 155, in main
    args.func(args)
  File "cmip6_object_store/cmip6_zarr/cli.py", line 80, in run_main
    tm.run_tasks()
  File "/home/users/astephen/cmip6/cmip6-object-store/cmip6_object_store/cmip6_zarr/task.py", line 176, in run_tasks
    task.run()
  File "/home/users/astephen/cmip6/cmip6-object-store/cmip6_object_store/cmip6_zarr/task.py", line 102, in _run_local
    delayed_obj = chunked_ds.to_zarr(store=store_map, mode='w', consolidated=True, compute=False)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/xarray/core/dataset.py", line 1634, in to_zarr
    append_dim=append_dim,
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/xarray/backends/api.py", line 1338, in to_zarr
    consolidate_on_close=consolidated,
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/xarray/backends/zarr.py", line 261, in open_group
    zarr_group = zarr.open_group(store, **open_kwargs)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/zarr/hierarchy.py", line 1169, in open_group
    synchronizer=synchronizer, path=path, chunk_store=chunk_store)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/zarr/hierarchy.py", line 115, in __init__
    err_group_not_found(path)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/zarr/errors.py", line 25, in err_group_not_found
    raise ValueError('group not found at path %r' % path)
ValueError: group not found at path ''

And, from LOTUS nodes:

INFO:/home/users/astephen/cmip6/cmip6-object-store/cmip6_object_store/cmip6_zarr/task.py:Running conversion locally: 4
INFO:/home/users/astephen/cmip6/cmip6-object-store/cmip6_object_store/cmip6_zarr/task.py:Processing: CMIP6.CMIP.CNRM-CERFACS.CNRM-CM6-1.historical.r10i1p1f2.Amon.ta.gr.v20190125
Traceback (most recent call last):
  File "/home/users/astephen/.local/lib/python3.7/site-packages/s3fs/core.py", line 428, in mkdir
    self.s3.create_bucket(**params)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.7/m3-4.6.14/envs/jaspy3.7-m3-4.6.14-r20200606/lib/python3.7/site-packages/botocore/client.py", line 635, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (SwarmError) when calling the CreateBucket operation: <html><body><h2>CAStor Error</h2><br>Replication failed with response code 409 (expected 100 or 201)</body></html>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/apps/slurm/spool/slurmd/job15860516/slurm_script", line 160, in <module>
    sys.exit(main())  # pragma: no cover
  File "/apps/slurm/spool/slurmd/job15860516/slurm_script", line 155, in main
    args.func(args)
  File "/apps/slurm/spool/slurmd/job15860516/slurm_script", line 80, in run_main
    tm.run_tasks()
  File "/home/users/astephen/cmip6/cmip6-object-store/cmip6_object_store/cmip6_zarr/task.py", line 166, in run_tasks
    task.run()
  File "/home/users/astephen/cmip6/cmip6-object-store/cmip6_object_store/cmip6_zarr/task.py", line 58, in _run_local
    store.create_bucket(bucket)
  File "/home/users/astephen/cmip6/cmip6-object-store/cmip6_object_store/cmip6_zarr/caringo_store.py", line 18, in create_bucket
    self._fs.mkdir(bucket_id)
  File "/home/users/astephen/.local/lib/python3.7/site-packages/s3fs/core.py", line 432, in mkdir
    raise translate_boto_error(e)
OSError: [Errno 5] An error occurred (SwarmError) when calling the CreateBucket operation: <html><body><h2>CAStor Error</h2><br>Replication failed with response code 409 (expected 100 or 201)</body></html>
agstephens commented 4 years ago

Put "retries = 3" in the config and embed it in the CaringoStore class. See if it helps.

agstephens commented 4 years ago

Closing as the retry approach seems to work.