fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
306 stars 78 forks source link

Kerchunk fails to produce `Data variables` from NASA ATL08 data. #264

Closed hyoklee closed 1 year ago

hyoklee commented 1 year ago

The test data ATL08 has 3 datasets. Here's the h5ls output:

METADATA                 Group
ancillary_data           Group
ds_geosegments           Dataset {5}
ds_metrics               Dataset {9}
ds_surf_type             Dataset {5}
gt1r                     Group
orbit_info               Group
quality_assessment       Group

However, none of them appears in print(ds) output from my ATL08.py test code.

<xarray.Dataset>
Dimensions:         (ds_geosegments: 5, ds_metrics: 9)
Coordinates:
  * ds_geosegments  (ds_geosegments) float32 1.0 2.0 3.0 4.0 5.0
  * ds_metrics      (ds_metrics) float32 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
Data variables:
    *empty*
Attributes: (12/47)
    Conventions:                        CF-1.6

See my test Action workflow result for easy verification.

martindurant commented 1 year ago

xarray assumes input datasets follow the netCDF4 layout. If you load the h5 file directly with xarray, you also get no variables

In [28]: xr.open_dataset("ATL08_20181014084920_02400109_003_01.h5")
Out[28]:
<xarray.Dataset>
Dimensions:         (ds_geosegments: 5, ds_metrics: 9, ds_surf_type: 5)
Coordinates:
  * ds_geosegments  (ds_geosegments) int8 1 2 3 4 5
  * ds_metrics      (ds_metrics) int8 1 2 3 4 5 6 7 8 9
  * ds_surf_type    (ds_surf_type) int32 1 2 3 4 5
Data variables:
    *empty*

However, kerchunk does faithfully reproduce this structure:

>>> out = kerchunk.hdf.SingleHdf5ToZarr("ATL08_20181014084920_02400109_003_01.h5").translate()
>>> m = fsspec.get_mapper("reference://", fo=out)
>>> zarr.open(m).tree()
/
 ├── METADATA
 │   ├── AcquisitionInformation
 │   │   ├── lidar
 │   │   ├── lidarDocument
 │   │   ├── platform
 │   │   └── platformDocument
 │   ├── DataQuality
 │   │   ├── CompletenessOmission
 │   │   └── DomainConsistency
 │   ├── DatasetIdentification
 │   ├── Extent
 │   ├── Lineage
 │   │   ├── ANC06-01
 │   │   ├── ANC06-02
 │   │   ├── ANC06-03
...

And you can change the path to get_mapper if you only want to access parts of the data tree, e.g., fsspec.get_mapper("reference://gt1r", fo=out) . So the question is, how do you think xarray ought to load this data, what should it look like?

hyoklee commented 1 year ago

It should look like this 1.4M JSON, which HPD generated.

The current Kerchunk output JSON is too small (180K) and missing too many things.

ATL08_20181014084920_02400109_003_01
martindurant commented 1 year ago

What is the kerchunk version missing exactly, why should it be the same size at this other tool's output?

Continuing:

>>> g = zarr.open(m)
>>> g.attrs["time_type"]
'CCSDS UTC-A'
>>> g.attrs.todict()
{'Conventions': 'CF-1.6',
 'citation': 'Cite these data in publications as follows: The data used in this study were produced by the ICESat-2 Science Project Office at NASA/GSFC. The data archive site is the NASA National Snow and Ice Data Center Distributed Active Archive Center.',
 'contributor_name': 'Thomas E Neumann (thomas.neumann@nasa.gov), Thorsten Markus (thorsten.markus@nasa.gov), Suneel Bhardwaj (suneel.bhardwaj@nasa.gov) David W Hancock III (david.w.hancock@nasa.gov)',
 'contributor_role': 'Instrument Engineer, Investigator, Principle Investigator, Data Producer, Data Producer',
 'creator_name': 'GSFC I-SIPS > ICESat-2 Science Investigator-led Processing System',
 'date_created': '2020-04-01T14:03:26.000000Z',
 'date_type': 'UTC',
...

all of those attributes are indeed there, if you are familiar with the zarr API.

hyoklee commented 1 year ago

You're missing all dataset attributes since Kerchunk/Xarray misses datasets completely. If you read and processed the data correctly, H5JSON, Kerchunk JSON, DRM++ size should be close. Kerchunk is off by 10X.

martindurant commented 1 year ago

Sorry, but which dataset is missing? Can you be precise?

hyoklee commented 1 year ago

See CDL and compare the CDL output from my workflow.

martindurant commented 1 year ago

I'm sorry, I have no idea what that is.

Please can you give a specific example of something in the HDF5 file that you cannot access via kerchunk/zarr?

martindurant commented 1 year ago

OK, so the data re stored in COMPACT form, which we don't support. Turning on logging for "h5-to-zarr" would have revealed this.