fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
297 stars 78 forks source link

Geoh5py reading in s3 #138

Open RichardScottOZ opened 2 years ago

RichardScottOZ commented 2 years ago

An example of an hdf5 file format.

https://geoh5py.readthedocs.io/en/stable/

"The geoh5py library has been created for the manipulation and storage of a wide range of geoscientific data (points, curve, surface, 2D and 3D grids) in geoh5 file format. Users will be able to directly leverage the powerful visualization capabilities of Geoscience ANALYST along with open-source code from the Python ecosystem."

So I use it quite a bit.

In fact, now have 200K similar files, in a couple of company s3 buckets.

e.g. image

So last night I started wondering if could access them quickly from s3 as opposed to generally read each one, get what I want and have several terabytes lying around.

I remembered the pangeo netcdf/hdf5 discussion etc, so I started having a look.

This would be pretty useful.

e.g. something like this? @rsignell-usgs

from kerchunk.hdf import SingleHdf5ToZarr
import kerchunk.hdf
import fsspec
s3 = s3fs.S3FileSystem(profile='appropriateprofile')

urls = ["s3://" + p for p in [
    's3://bananasplits/100075.geoh5'
]]
so = dict(
    anon=False, default_fill_cache=False, default_cache_type='first'
)
singles = []
for u in urls:
    with s3.open(u, **so) as inf:
        h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, u, inline_threshold=100)
        singles.append(h5chunks.translate())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_24276/251853116.py in <module>
     13     with s3.open(u, **so) as inf:
     14         h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, u, inline_threshold=100)
---> 15         singles.append(h5chunks.translate())

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\kerchunk\hdf.py in translate(self)
     71         lggr.debug('Translation begins')
     72         self._transfer_attrs(self._h5f, self._zroot)
---> 73         self._h5f.visititems(self._translator)
     74         if self.inline > 0:
     75             self._do_inline(self.inline)

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\h5py\_hl\group.py in visititems(self, func)
    611                 name = self._d(name)
    612                 return func(name, self[name])
--> 613             return h5o.visit(self.id, proxy)
    614 
    615     @with_phil

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\h5o.pyx in h5py.h5o.visit()

h5py\h5o.pyx in h5py.h5o.cb_obj_simple()

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\h5py\_hl\group.py in proxy(name)
    610                 """ Use the text name of the object, not bytes """
    611                 name = self._d(name)
--> 612                 return func(name, self[name])
    613             return h5o.visit(self.id, proxy)
    614 

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\kerchunk\hdf.py in _translator(self, name, h5obj)
    188 
    189             # Create a Zarr array equivalent to this HDF5 dataset...
--> 190             za = self._zroot.create_dataset(h5obj.name, shape=h5obj.shape,
    191                                             dtype=h5obj.dtype,
    192                                             chunks=h5obj.chunks or False,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in create_dataset(self, name, **kwargs)
    806         """
    807 
--> 808         return self._write_op(self._create_dataset_nosync, name, **kwargs)
    809 
    810     def _create_dataset_nosync(self, name, data=None, **kwargs):

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in _write_op(self, f, *args, **kwargs)
    659 
    660         with lock:
--> 661             return f(*args, **kwargs)
    662 
    663     def create_group(self, name, overwrite=False):

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\hierarchy.py in _create_dataset_nosync(self, name, data, **kwargs)
    818         # create array
    819         if data is None:
--> 820             a = create(store=self._store, path=path, chunk_store=self._chunk_store,
    821                        **kwargs)
    822 

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\creation.py in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, cache_attrs, read_only, object_codec, dimension_separator, **kwargs)
    134 
    135     # initialize array metadata
--> 136     init_array(store, shape=shape, chunks=chunks, dtype=dtype, compressor=compressor,
    137                fill_value=fill_value, order=order, overwrite=overwrite, path=path,
    138                chunk_store=chunk_store, filters=filters, object_codec=object_codec,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\storage.py in init_array(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec, dimension_separator)
    350     _require_parent_group(path, store=store, chunk_store=chunk_store, overwrite=overwrite)
    351 
--> 352     _init_array_metadata(store, shape=shape, chunks=chunks, dtype=dtype,
    353                          compressor=compressor, fill_value=fill_value,
    354                          order=order, overwrite=overwrite, path=path,

~\AppData\Local\Continuum\anaconda3\envs\stuartshelf\lib\site-packages\zarr\storage.py in _init_array_metadata(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec, dimension_separator)
    427             if not filters:
    428                 # there are no filters so we can be sure there is no object codec
--> 429                 raise ValueError('missing object_codec for object array')
    430             else:
    431                 # one of the filters may be an object codec, issue a warning rather

ValueError: missing object_codec for object array
RichardScottOZ commented 2 years ago

E.g., tinkering

{'version': 1,
 'refs': {'.zgroup': '{"zarr_format":2}',
  'SAMPLING_GRID/.zarray': '{\n    "chunks": [\n        1\n    ],\n    "compressor": null,\n    "dtype": "<i4",\n    "fill_value": 0,\n    "filters": null,\n    "order": "C",\n    "shape": [\n        1\n    ],\n    "zarr_format": 2\n}',
  'SAMPLING_GRID/0': '\x00\x00\x00\x00',
  'SAMPLING_GRID/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "SAMPLING_GRID"\n    ]\n}',
  'X/.zarray': '{\n    "chunks": [\n        1\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [\n        0\n    ],\n    "zarr_format": 2\n}',
  'X/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "X"\n    ]\n}',
  'Y/.zarray': '{\n    "chunks": [\n        1\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [\n        0\n    ],\n    "zarr_format": 2\n}',
  'Y/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "Y"\n    ]\n}',
  'Z/.zarray': '{\n    "chunks": [\n        1\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [\n        0\n    ],\n    "zarr_format": 2\n}',
  'Z/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "Z"\n    ]\n}',
  'GEOSCIENCE/Data/{2e932682-0772-4342-9e13-423af5e8bf80}/Data/.zarray': '{"chunks":[1,938],"compressor":{"id":"zlib","level":9},"dtype":"<f8","fill_value":0.0,"filters":null,"order":"C","shape":[1,15000],"zarr_format":2}',
  'GEOSCIENCE/Data/{2e932682-0772-4342-9e13-423af5e8bf80}/Data/.zattrs': '{"_ARRAY_DIMENSIONS":["SAMPLING_GRID","phony_dim_0"]}',
  'GEOSCIENCE/Data/{2e932682-0772-4342-9e13-423af5e8bf80}/Data/0.Data/{2e932682-0772-4342-9e13-423af5e8bf80}/Data/0': ['s3://easi-oz/BlockGolem/results/MiraDataPrep/100075.geoh5',
   56363221,
   3481],
  'GEOSCIENCE/Data/{2e932682-0772-4342-9e13-423af5e8bf80}/Data/0.Data/{2e932682-0772-4342-9e13-423af5e8bf80}/Data/1': ['s3://easi-oz/BlockGolem/results/MiraDataPrep/100075.geoh5',
   56366702,
   3289],
  'GEOSCIENCE/Data/{2e932682-0772-4342-9e13-423af5e8bf80}/Data/0.Data/{2e932682-0772-4342-9e13-423af5e8bf80}/Data/2': ['s3://easi-oz/BlockGolem/results/MiraDataPrep/100075.geoh5',
   56369991,
   3149],
RichardScottOZ commented 2 years ago

Geoh5py has (N, 3) type things as opposed to (X, Y, Z) as well, which is not so xarray.

martindurant commented 2 years ago

So this is exactly the problem. In strict netCDF layout, it's clear which variable depends on which coordinate, and that is what the "_ARRAY_DIMENSIONS"s are about (this is an xarray-specific convention to carry the same information in a zarr dataset).

That the arrays are (N, 3) is not a problem in itself, and I would assume that for every such array, you would get a combined (T, N, 3) where T is the timestamp of each file; but where would these times be stored in the output tree, and is there any tool that can make use of such a zarr dataset? Can Geo5py access multiple HDF files as a single dataset at the moment?

RichardScottOZ commented 2 years ago

Timestamp or gridid, something like that. I suppose it would be 'top level' as such?

No, no multiple file access - it would be more an analysis tool I think. You can import lots of them into one 'project' as such, but I think this one would be a bit big for a workstation.

The interesting part you wanted 100K histograms for example, without downloading/moving many terabytes, that sort of idea. Or suddenly decided you wanted to look at the Inversion at iteration 5 for that area which was 1876 grid locations.

RichardScottOZ commented 2 years ago

I have achieved this, too, so thanks!

martindurant commented 2 years ago

Would love to have this linked in the kerchunk docs examples page.

RichardScottOZ commented 2 years ago

Hopefully write something up for you shortly.

RichardScottOZ commented 2 years ago

@martindurant - here you go https://github.com/RichardScottOZ/Kerchunk-geoh5/blob/main/Kerchunk-Block-Model-Analysis.ipynb

martindurant commented 2 years ago

Thank you! That looks very nice and ...complicated :) We'll definitely include a link.

I wonder if you have a simple explanation of what the "z['GEOSCIENCE']['Data']['{00e6c2c7-eefe-4618-b9e6-3fc08177b96b}']['Data']" hierarchy means, and what the (two) array contained therein are.

Also, how do the different input files of total dataset relate to one-another?

Why bananas?!

RichardScottOZ commented 2 years ago

Yeah, 3D is a bit more complicated.

Yes, I should put a link in at least to the geoh5py docs

Banana Splits, first thing I thought of that is obviously not a real work bucket!