fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
303 stars 78 forks source link

Issue reading GZIP compressed GRIB2 remote file #384

Open keltonhalbert opened 11 months ago

keltonhalbert commented 11 months ago

This feels like its probably an edge case, but I wanted to bring it up in case there's an opportunity to help fix this.

I'm attempting to read a GZIP compressed GRIB2 file from an AWS store, which can be found here: s3://noaa-mrms-pds/CONUS/MergedReflectivityAtLowestAltitude_00.50/20230419/MRMS_MergedReflectivityAtLowestAltitude_00.50_20230419-235043.grib2.gz

I'm able to call scan_grib successfully:

scan = scan_grib("s3://"+mrms_file_name, storage_options={"anon": True, "compression": "gzip"})

I won't dump the whole output, but example/proof:

[{'version': 1, 'refs': {'.zgroup': '{"zarr_format":2}', '.zattrs': '{"GRIB_centre":"161","GRIB_centreDescription":"161","GRIB_edition":2,"GRIB_subCentre":0,"coordinates":"heightAboveSea latitude longitude step time valid_time","institution":"161"}', 'unknown/.zarray': '{"chunks":[3500,7000],"compressor":null,"dtype":"<f8","fill_value":null,"filters":[{"dtype":"float64","id":"grib","var":"unknown"}],"order":"C","shape":[3500,7000],"zarr_format":2}', 'unknown/0.0': ['{{u}}', 0, 5670492], 'unknown/.zattrs': '{"GRIB_NV":0,"GRIB_Nx":7000,"GRIB_Ny":3500,"GRIB_cfName":"unknown","GRIB_cfVarName":"unknown","GRIB_dataType":"ra","GRIB_gridDefinitionDescription":"Latitude\\/longitude","GRIB_gridType":"regular_ll","GRIB_iDirectionIncrementInDegrees":0.01,"GRIB_iScansNegatively":0,"GRIB_jDirectionIncrementInDegrees":0.01,"GRIB_jPointsAreConsecutive":0,"GRIB_jScansPositively":0,"GRIB_latitudeOfFirstGridPointInDegrees":54.995,"GRIB_latitudeOfLastGridPointInDegrees":20.005001,"GRIB_longitudeOfFirstGridPointInDegrees":230.005,"GRIB_longitudeOfLastGridPointInDegrees":299.994998,"GRIB_missingValue":3.4028234663852886e+38,"GRIB_name":"unknown","GRIB_numberOfPoints":24500000,"GRIB_paramId":0,"GRIB_shortName":"unknown","GRIB_stepType":"instant","GRIB_stepUnits":1,"GRIB_typeOfLevel":"heightAboveSea","GRIB_units":"unknown","_ARRAY_DIMENSIONS":["latitude","longitude"],"long_name":"unknown","standard_name":"unknown","units":"unknown"}', 'heightAboveSea/.zarray': '{"chunks":[],"compressor":null,"dtype":"<f8","fill_value":null,"filters":[{"dtype":"float64","id":"grib","var":"heightAboveSea"}],"order":"C","shape":[],"zarr_format":2}', 'heightAboveSea/': ['{{u}}', 0, 5670492], 'heightAboveSea/.zattrs': '{"_ARRAY_DIMENSIONS":[]}', 'latitude/.zarray': '{"chunks":[3500],"compressor":null,"dtype":"<f8","fill_value":null,"filters":null,"order":"C","shape":[3500],"zarr_format":2}', 'latitude/0': 'base64:j8L1KFx/S0CuR+F6 ... ... ...

However, when I try to access the array values, they are all zeros...

ds = xr.open_dataset(filename, engine="kerchunk")
print(ds["unknown"])
print(ds["unknown"].values.min(), ds["unknown"].values.max())
...
Coordinates:
    heightAboveSea  float64 0.0
  * latitude        (latitude) float64 54.99 54.98 54.98 ... 20.03 20.02 20.01
  * longitude       (longitude) float64 230.0 230.0 230.0 ... 300.0 300.0 300.0
    step            timedelta64[ns] 00:00:00
    time            datetime64[ns] 1970-01-01
    valid_time      datetime64[ns] ...
Attributes: (12/29)
    GRIB_NV:                                  0
    GRIB_Nx:                                  7000
    GRIB_Ny:                                  3500
    GRIB_cfName:                              unknown
    GRIB_cfVarName:                           unknown
    GRIB_dataType:                            ra
    ...                                       ...
    GRIB_stepUnits:                           1
    GRIB_typeOfLevel:                         heightAboveSea
    GRIB_units:                               unknown
    long_name:                                unknown
    standard_name:                            unknown
    units:                                    unknown
0.0 0.0

If I read the uncompressed grib2 file natively with cfgrib/xarray, it works just fine.

<xarray.DataArray 'unknown' (latitude: 3500, longitude: 7000)>
[24500000 values with dtype=float32]
Coordinates:
    time            datetime64[ns] 2023-04-19T23:50:00
    step            timedelta64[ns] 00:00:00
    heightAboveSea  float64 500.0
  * latitude        (latitude) float64 54.99 54.98 54.98 ... 20.03 20.02 20.01
  * longitude       (longitude) float64 230.0 230.0 230.0 ... 300.0 300.0 300.0
    valid_time      datetime64[ns] ...
Attributes: (12/29)
    GRIB_paramId:                             0
    GRIB_dataType:                            ra
    GRIB_numberOfPoints:                      24500000
    GRIB_typeOfLevel:                         heightAboveSea
    GRIB_stepUnits:                           1
    GRIB_stepType:                            instant
    ...                                       ...
    GRIB_name:                                unknown
    GRIB_shortName:                           unknown
    GRIB_units:                               unknown
    long_name:                                unknown
    units:                                    unknown
    standard_name:                            unknown
-999.0 68.7

Presumably, this has to do with the remote store being gzip compressed, and when xarray/zarr goes to read the array's byte range, something goes wrong. However, there are no errors returned, just an array full of zeros. Is it possible to propagate the storage options, or let Zarr know about the GZIP compression? FWIW, I tried setting inline_threshold in scan_grib to an absurdly large value to inline the array data, and I still get the same result: min and max are zero.

Any ideas on where and how to start tracking this one down?

keltonhalbert commented 11 months ago

I see there was some discussion about a library that could facilitate this kind of data access in #281.

If I can get some insight into how and where this should be implemented, I'd be happy to take a crack at it.

keltonhalbert commented 11 months ago

Perhaps it would be possible to generate the gzip index sidecar during scan_grib, and save it as a metadata field as a base64 string that can be decoded? And then, somehow incorporate that info into dereference_archives?

martindurant commented 11 months ago

Yes, you are completely on the right lines for the kind of work it would take to be able to reference byte ranges within a compressed file. Tying up the pieces would probably not be that simple... You should be aware that the gzip version of indexing (as opposed to bzip2 or zstd) requires you storing a rather large amount of data, 32kB per checkpoint. The current state of indexed_gzip doesn't allow you to pick your checkpoints, but we could generate many and pick only the ones we need in a two-step process.

keltonhalbert commented 11 months ago

Yeah, I don't imagine this will be simple in the slightest, but it would certainly be cool if it worked!

Right now I'm just trying to wrap my head around the indexed_gzip library and how kerchunk actually needs to interface with it. While I have some experience working with things like grib2/hdf5/netcdf at a low-ish level, I've never really worked with archives or had to think about how they're stored.

This might be naive or dumb, but I did notice this in the indexed_gzip documentation for IndexedGzipFile:

:arg auto_build:       If ``True`` (the default), the index is
                               automatically built on calls to :meth:`seek`.

Does this imply that an index can be built based off of the calls to seek? If so, maybe it would be possible to build the index for the seek points as scan_grib is decoding the grib2 message with eccodes/cfgrib, which in turn could be used to provide the bare minimum number of checkpoints to read arrays from their start bytes? My thought is that scan_grib is already appropriately reading the decompressed grib2 metadata, in which the uncompressed byte ranges can be used to generate the index... but perhaps I'm misunderstanding some terminology here.

If I'm not totally out to lunch here, then the size of the side-car file would scale with the number of grib2 messages/arrays in the file. Probably not ideal, but neither is storing gzip compressed grib2 data in a cloud storage bucket. I'd prefer it if people would just make their data usable to begin with, but that's a dream that'll never come true :).

martindurant commented 11 months ago

Does this imply that an index can be built based off of the calls to seek?

No; when you seek forward in the file, indexed_gzip will write all of the checkpoints up to that point. This is because gzip must be streamed through in order to know where you are up to at the bit level. It should be possible to only save checkpoints of interest, but that would require editing the code in indexed_gzip. From the outside, probably the best we can do is to generate all the checkpoints to a local file, at a reasonably small spacing. Next, take a second pass through and keep only the ones immediately before grib message offsets - maybe that is small enough to inline into a references file, or maybe we store this as a separate sidecar (I am leaning to the latter).

martindurant commented 11 months ago

the size of the side-car file would scale with the number of grib2 messages/arrays in the file

Yes, 32kB per offset. Maybe a bit big to store in a JSON file (where they would need to be base64 encoded) for several messages and then combinations of potentially many files. Naturally, these 32kB blocks will not compress well.