JackKelly / hypergrib

Lazily open petabyte-scale GRIB datasets consisting of trillions of GRIB messages from xarray
MIT License
40 stars 1 forks source link

Get list of init datetimes by reading GEFS directory names #22

Open JackKelly opened 1 day ago

JackKelly commented 1 day ago

We can get all the init datetimes "just" by reading the directory names perhaps using ObjectStore::list_with_delimiter which is not recursive. Although note that the docs for object_store::ListResult say "Individual result sets may be limited to 1,000 objects based on the underlying object storage’s limitations.".

If this is too slow or doesn't work then we may have to use the native cloud storage APIs: for example, GCP includes a matchGlob query parameter that could be very useful. Or maybe extend object_store to implement match_glob. BUT! AWS ListObjectsV2 doesn't appear to support a similar function.

Parse the datetime string into a DateTime<Utc>.

JackKelly commented 1 day ago

Hmm, I think that aws::list_with_delimiter just uses the default implementation of list_with_delimiter which doesn't push anything down; it just uses list_paginated and finds the common prefixes on the client-side.

JackKelly commented 1 day ago

Although, in the example "Sample Request for general purpose buckets: Listing keys by using the prefix and delimiter parameters", it looks like AWS ListObjectsV2 can just return a list of directory names!