Open keltonhalbert opened 11 months ago
I see there was some discussion about a library that could facilitate this kind of data access in #281.
If I can get some insight into how and where this should be implemented, I'd be happy to take a crack at it.
Perhaps it would be possible to generate the gzip index sidecar during scan_grib, and save it as a metadata field as a base64 string that can be decoded? And then, somehow incorporate that info into dereference_archives
?
Yes, you are completely on the right lines for the kind of work it would take to be able to reference byte ranges within a compressed file. Tying up the pieces would probably not be that simple... You should be aware that the gzip version of indexing (as opposed to bzip2 or zstd) requires you storing a rather large amount of data, 32kB per checkpoint. The current state of indexed_gzip doesn't allow you to pick your checkpoints, but we could generate many and pick only the ones we need in a two-step process.
Yeah, I don't imagine this will be simple in the slightest, but it would certainly be cool if it worked!
Right now I'm just trying to wrap my head around the indexed_gzip
library and how kerchunk actually needs to interface with it. While I have some experience working with things like grib2/hdf5/netcdf at a low-ish level, I've never really worked with archives or had to think about how they're stored.
This might be naive or dumb, but I did notice this in the indexed_gzip
documentation for IndexedGzipFile
:
:arg auto_build: If ``True`` (the default), the index is
automatically built on calls to :meth:`seek`.
Does this imply that an index can be built based off of the calls to seek? If so, maybe it would be possible to build the index for the seek points as scan_grib is decoding the grib2 message with eccodes/cfgrib, which in turn could be used to provide the bare minimum number of checkpoints to read arrays from their start bytes? My thought is that scan_grib is already appropriately reading the decompressed grib2 metadata, in which the uncompressed byte ranges can be used to generate the index... but perhaps I'm misunderstanding some terminology here.
If I'm not totally out to lunch here, then the size of the side-car file would scale with the number of grib2 messages/arrays in the file. Probably not ideal, but neither is storing gzip compressed grib2 data in a cloud storage bucket. I'd prefer it if people would just make their data usable to begin with, but that's a dream that'll never come true :).
Does this imply that an index can be built based off of the calls to seek?
No; when you seek forward in the file, indexed_gzip will write all of the checkpoints up to that point. This is because gzip must be streamed through in order to know where you are up to at the bit level. It should be possible to only save checkpoints of interest, but that would require editing the code in indexed_gzip. From the outside, probably the best we can do is to generate all the checkpoints to a local file, at a reasonably small spacing. Next, take a second pass through and keep only the ones immediately before grib message offsets - maybe that is small enough to inline into a references file, or maybe we store this as a separate sidecar (I am leaning to the latter).
the size of the side-car file would scale with the number of grib2 messages/arrays in the file
Yes, 32kB per offset. Maybe a bit big to store in a JSON file (where they would need to be base64 encoded) for several messages and then combinations of potentially many files. Naturally, these 32kB blocks will not compress well.
This feels like its probably an edge case, but I wanted to bring it up in case there's an opportunity to help fix this.
I'm attempting to read a GZIP compressed GRIB2 file from an AWS store, which can be found here: s3://noaa-mrms-pds/CONUS/MergedReflectivityAtLowestAltitude_00.50/20230419/MRMS_MergedReflectivityAtLowestAltitude_00.50_20230419-235043.grib2.gz
I'm able to call scan_grib successfully:
I won't dump the whole output, but example/proof:
However, when I try to access the array values, they are all zeros...
If I read the uncompressed grib2 file natively with cfgrib/xarray, it works just fine.
Presumably, this has to do with the remote store being gzip compressed, and when xarray/zarr goes to read the array's byte range, something goes wrong. However, there are no errors returned, just an array full of zeros. Is it possible to propagate the storage options, or let Zarr know about the GZIP compression? FWIW, I tried setting inline_threshold in scan_grib to an absurdly large value to inline the array data, and I still get the same result: min and max are zero.
Any ideas on where and how to start tracking this one down?