Closed hyoklee closed 1 year ago
xarray assumes input datasets follow the netCDF4 layout. If you load the h5 file directly with xarray, you also get no variables
In [28]: xr.open_dataset("ATL08_20181014084920_02400109_003_01.h5")
Out[28]:
<xarray.Dataset>
Dimensions: (ds_geosegments: 5, ds_metrics: 9, ds_surf_type: 5)
Coordinates:
* ds_geosegments (ds_geosegments) int8 1 2 3 4 5
* ds_metrics (ds_metrics) int8 1 2 3 4 5 6 7 8 9
* ds_surf_type (ds_surf_type) int32 1 2 3 4 5
Data variables:
*empty*
However, kerchunk does faithfully reproduce this structure:
>>> out = kerchunk.hdf.SingleHdf5ToZarr("ATL08_20181014084920_02400109_003_01.h5").translate()
>>> m = fsspec.get_mapper("reference://", fo=out)
>>> zarr.open(m).tree()
/
├── METADATA
│ ├── AcquisitionInformation
│ │ ├── lidar
│ │ ├── lidarDocument
│ │ ├── platform
│ │ └── platformDocument
│ ├── DataQuality
│ │ ├── CompletenessOmission
│ │ └── DomainConsistency
│ ├── DatasetIdentification
│ ├── Extent
│ ├── Lineage
│ │ ├── ANC06-01
│ │ ├── ANC06-02
│ │ ├── ANC06-03
...
And you can change the path to get_mapper if you only want to access parts of the data tree, e.g., fsspec.get_mapper("reference://gt1r", fo=out)
.
So the question is, how do you think xarray ought to load this data, what should it look like?
What is the kerchunk version missing exactly, why should it be the same size at this other tool's output?
Continuing:
>>> g = zarr.open(m)
>>> g.attrs["time_type"]
'CCSDS UTC-A'
>>> g.attrs.todict()
{'Conventions': 'CF-1.6',
'citation': 'Cite these data in publications as follows: The data used in this study were produced by the ICESat-2 Science Project Office at NASA/GSFC. The data archive site is the NASA National Snow and Ice Data Center Distributed Active Archive Center.',
'contributor_name': 'Thomas E Neumann (thomas.neumann@nasa.gov), Thorsten Markus (thorsten.markus@nasa.gov), Suneel Bhardwaj (suneel.bhardwaj@nasa.gov) David W Hancock III (david.w.hancock@nasa.gov)',
'contributor_role': 'Instrument Engineer, Investigator, Principle Investigator, Data Producer, Data Producer',
'creator_name': 'GSFC I-SIPS > ICESat-2 Science Investigator-led Processing System',
'date_created': '2020-04-01T14:03:26.000000Z',
'date_type': 'UTC',
...
all of those attributes are indeed there, if you are familiar with the zarr API.
You're missing all dataset attributes since Kerchunk/Xarray misses datasets completely. If you read and processed the data correctly, H5JSON, Kerchunk JSON, DRM++ size should be close. Kerchunk is off by 10X.
Sorry, but which dataset is missing? Can you be precise?
I'm sorry, I have no idea what that is.
Please can you give a specific example of something in the HDF5 file that you cannot access via kerchunk/zarr?
OK, so the data re stored in COMPACT form, which we don't support. Turning on logging for "h5-to-zarr" would have revealed this.
The test data ATL08 has 3 datasets. Here's the
h5ls
output:However, none of them appears in print(ds) output from my ATL08.py test code.
See my test Action workflow result for easy verification.