fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
297 stars 78 forks source link

How to use subchunking #467

Closed rsignell closed 2 months ago

rsignell commented 2 months ago

Using the fixes in #466, I'm able to overcome the NetCDF 64-bit offset issue in #465.

With this data, I'd like to use kerchunk.utils.subchunk since the data chunk size on variables like 'salinity' is about 1GB: image

The syntax for kerchunk.utils.subchunk is subchunk(store, variable, factor) where factor is:

factor: int
        the number of chunks each input chunk turns into. Must be an exact divisor
        of the original largest dimension length.

In this dataset the biggest dimension is lon=4500, which is also the fastest varying dimension.

But when I tried factor=10, it didn't work. What am I doing wrong? Notebook here Screenshot 2024-06-23 064213

martindurant commented 2 months ago

The problem, as I hinted in a previous comment, is the leading 1, dimension for time - which clearly doesn't divide to anything. I'll need to tweak the code to delve to deeper dimention(s) than the first.

martindurant commented 2 months ago

I should have explained:

the original largest dimension length

this means in terms of memory layout (slowest varying), not length. So, time is the biggest, with length 1, followed by depth.

martindurant commented 2 months ago

I should also note, that if only zarr allowed intra-chunk ranges for uncompressed data, none of this would be necessary.

rsignell commented 2 months ago

So does that mean we can't subchunk this data because the length of time is 1? (e.g. only factor=1 is possible, which of course wouldn't do anything)

martindurant commented 2 months ago

we can't subchunk this data because the length of time is 1

Correct, for the current implementation. Of course, (1, 40, ...) -> (1, 4, ...) should of course be possible, if I get around to making the required changes.

rsignell commented 2 months ago

That would be great, as I think this would then actually allow this dataset to be accessed in a reasonable way!

martindurant commented 2 months ago

I have put a potential fix into #466 (which still has a non-related outstanding test failure)

rsignell commented 2 months ago

I gave this potential fix a try, but now NetCDF3ToZarr seems broken?

file0 = 'hycom-gofs-3pt1-reanalysis/2014/hycom_GLBv0.08_538_2014010112_t000.nc'
d = NetCDF3ToZarr("s3://" + file0, storage_options={"anon": True},
                  inline_threshold=400, version=2).translate()

produces the error

---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:113](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=112), in _error_wrapper(func, args, kwargs, retries)
    112 try:
--> 113     return await func(*args, **kwargs)
    114 except S3_RETRYABLE_ERRORS as e:

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/aiobotocore/client.py:411](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/aiobotocore/client.py#line=410), in AioBaseClient._make_api_call(self, operation_name, api_params)
    410     error_class = self.exceptions.from_code(error_code)
--> 411     raise error_class(parsed_response, operation_name)
    412 else:

ClientError: An error occurred (InvalidAccessKeyId) when calling the GetObject operation: The AWS Access Key Id you provided does not exist in our records.

The above exception was the direct cause of the following exception:

PermissionError                           Traceback (most recent call last)
File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/fsspec/asyn.py:245](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/fsspec/asyn.py#line=244), in _run_coros_in_chunks.<locals>._run_coro(coro, i)
    244 try:
--> 245     return await asyncio.wait_for(coro, timeout=timeout), i
    246 except Exception as e:

File [~/miniforge3/envs/pangeo/lib/python3.11/asyncio/tasks.py:442](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/asyncio/tasks.py#line=441), in wait_for(fut, timeout)
    441 if timeout is None:
--> 442     return await fut
    444 if timeout <= 0:

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:1128](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=1127), in S3FileSystem._cat_file(self, path, version_id, start, end)
   1126         resp["Body"].close()
-> 1128 return await _error_wrapper(_call_and_read, retries=self.retries)

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:145](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=144), in _error_wrapper(func, args, kwargs, retries)
    144 err = translate_boto_error(err)
--> 145 raise err

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:113](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=112), in _error_wrapper(func, args, kwargs, retries)
    112 try:
--> 113     return await func(*args, **kwargs)
    114 except S3_RETRYABLE_ERRORS as e:

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:1115](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=1114), in S3FileSystem._cat_file.<locals>._call_and_read()
   1114 async def _call_and_read():
-> 1115     resp = await self._call_s3(
   1116         "get_object",
   1117         Bucket=bucket,
   1118         Key=key,
   1119         **version_id_kw(version_id or vers),
   1120         **head,
   1121         **self.req_kw,
   1122     )
   1123     try:

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:365](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=364), in S3FileSystem._call_s3(self, method, *akwarglist, **kwargs)
    364 additional_kwargs = self._get_s3_method_kwargs(method, *akwarglist, **kwargs)
--> 365 return await _error_wrapper(
    366     method, kwargs=additional_kwargs, retries=self.retries
    367 )

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py:145](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/s3fs/core.py#line=144), in _error_wrapper(func, args, kwargs, retries)
    144 err = translate_boto_error(err)
--> 145 raise err

PermissionError: The AWS Access Key Id you provided does not exist in our records.

The above exception was the direct cause of the following exception:

ReferenceNotReachable                     Traceback (most recent call last)
Cell In[7], line 2
      1 d = NetCDF3ToZarr("s3://" + flist[0], storage_options={"anon": True},
----> 2                   inline_threshold=400, version=2).translate()

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/netCDF3.py:288](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/netCDF3.py#line=287), in NetCDF3ToZarr.translate(self)
    280 z.attrs.update(
    281     {
    282         k: v.decode() if isinstance(v, bytes) else str(v)
   (...)
    285     }
    286 )
    287 if self.threshold:
--> 288     out = inline_array(out, self.threshold, remote_options=self.storage_options)
    290 if isinstance(out, LazyReferenceMapper):
    291     out.flush()

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/utils.py:230](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/utils.py#line=229), in inline_array(store, threshold, names, remote_options)
    226 fs = fsspec.filesystem(
    227     "reference", fo=store, **(remote_options or {}), skip_instance_cache=True
    228 )
    229 g = zarr.open_group(fs.get_mapper(), mode="r+")
--> 230 _inline_array(g, threshold, names=names or [])
    231 return fs.references

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/utils.py:193](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/kerchunk/utils.py#line=192), in _inline_array(group, threshold, names, prefix)
    187 if cond1 or cond2:
    188     original_attrs = dict(thing.attrs)
    189     arr = group.create_dataset(
    190         name=name,
    191         dtype=thing.dtype,
    192         shape=thing.shape,
--> 193         data=thing[:],
    194         chunks=thing.shape,
    195         compression=None,
    196         overwrite=True,
    197         fill_value=thing.fill_value,
    198     )
    199     arr.attrs.update(original_attrs)

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:800](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=799), in Array.__getitem__(self, selection)
    798     result = self.get_orthogonal_selection(pure_selection, fields=fields)
    799 else:
--> 800     result = self.get_basic_selection(pure_selection, fields=fields)
    801 return result

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:926](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=925), in Array.get_basic_selection(self, selection, out, fields)
    924     return self._get_basic_selection_zd(selection=selection, out=out, fields=fields)
    925 else:
--> 926     return self._get_basic_selection_nd(selection=selection, out=out, fields=fields)

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:968](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=967), in Array._get_basic_selection_nd(self, selection, out, fields)
    962 def _get_basic_selection_nd(self, selection, out=None, fields=None):
    963     # implementation of basic selection for array with at least one dimension
    964 
    965     # setup indexer
    966     indexer = BasicIndexer(selection, self)
--> 968     return self._get_selection(indexer=indexer, out=out, fields=fields)

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:1343](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=1342), in Array._get_selection(self, indexer, out, fields)
   1340 if math.prod(out_shape) > 0:
   1341     # allow storage to get multiple items at once
   1342     lchunk_coords, lchunk_selection, lout_selection = zip(*indexer)
-> 1343     self._chunk_getitems(
   1344         lchunk_coords,
   1345         lchunk_selection,
   1346         out,
   1347         lout_selection,
   1348         drop_axes=indexer.drop_axes,
   1349         fields=fields,
   1350     )
   1351 if out.shape:
   1352     return out

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py:2179](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/core.py#line=2178), in Array._chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection, drop_axes, fields)
   2177     if not isinstance(self._meta_array, np.ndarray):
   2178         contexts = ConstantMap(ckeys, constant=Context(meta_array=self._meta_array))
-> 2179     cdatas = self.chunk_store.getitems(ckeys, contexts=contexts)
   2181 for ckey, chunk_select, out_select in zip(ckeys, lchunk_selection, lout_selection):
   2182     if ckey in cdatas:

File [~/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/storage.py:1435](http://localhost:8888/lab/tree/repos/opendata-coawst/miniforge3/envs/pangeo/lib/python3.11/site-packages/zarr/storage.py#line=1434), in FSStore.getitems(self, keys, contexts)
   1432     continue
   1433 elif isinstance(v, Exception):
   1434     # Raise any other exception
-> 1435     raise v
   1436 else:
   1437     # The function calling this method may not recognize the transformed
   1438     # keys, so we send the values returned by self.map.getitems back into
   1439     # the original key space.
   1440     results[keys_transformed[k]] = v

ReferenceNotReachable: Reference "depth/0" failed to fetch target ['s3://hycom-gofs-3pt1-reanalysis/2014/hycom_GLBv0.08_538_2014010112_t000.nc', 4640, 320]
martindurant commented 2 months ago

Sorry, messed up passing remote options for inlining.

rsignell commented 2 months ago

Fixed in #466