A cloud optimized Python package for reading HDF5 data stored in S3
h5coro is a pure Python implementation of a subset of the HDF5 specification that has been optimized for reading data out of S3. The project has its roots in the development of an on-demand science data processing system called SlideRule, where a new C++ implementation of the HDF5 specification was developed for performant read access to Earth science datasets stored in AWS S3. Over time, user's of SlideRule began requesting the ability to performantly read HDF5 and NetCDF files out of S3 from their own Python scripts. The result is h5coro: the re-implementation in Python of the core HDF5 reading logic that exists in SlideRule. Since then, h5coro has become its own project, which will continue to grow and diverge in functionality from its parent implementation. For more information on SlideRule and the organization behind h5coro, see https://slideruleearth.io.
h5coro is optimized for reading HDF5 data in high-latency high-throughput environments. It accomplishes this through a few key design decisions:
For a full list of which parts of the HDF5 specification h5coro implements, see the compatibility section at the end of this readme. The major limitations currently present in the package are:
The simplest way to install h5coro is by using the conda package manager.
conda install -c conda-forge h5coro
Alternatively, you can also install h5coro using pip.
pip install h5coro
To use h5coro
as a backend to xarray, simply install both
xarray
and h5coro
in your current environment.
h5coro
will automatically be recognized by xarray
,
so you can use it like any other xarray engine:
import xarray as xr
h5ds = xr.open_dataset("file.h5", engine="h5coro")
You can see what backends are available in xarray using:
xr.backends.list_engines()
# (1) import
from h5coro import h5coro, s3driver
# (2) create
h5obj = h5coro.H5Coro(f'{my_bucket}/{path_to_hdf5_file}', s3driver.S3Driver)
# (3) read
datasets = [{'dataset': '/path/to/dataset1', 'hyperslice': []},
{'dataset': '/path/to/dataset2', 'hyperslice': [324, 374]}]
promise = h5obj.readDatasets(datasets=datasets, block=True)
# (4) display
for variable in promise:
print(f'{variable}: {promise[variable]}')
h5coro
: the main module implementing the HDF5 reader object
s3driver
: the driver used to read HDF5 data from S3
The call to h5coro.H5Coro
creates a reader object that opens up the HDF5 file, reads the start of the file, and is then ready to accept read requests.
The calling application must have credentials to access the object in the specified S3 bucket. h5coro uses boto3
, so any credentials supplied via the standard AWS methods will work. If credentials need to be supplied externally, then in the call to h5coro.H5Coro
pass in an argument credentials
as a dictionary with the following three fields: "aws_access_key_id", "aws_secret_access_key", "aws_session_token".
The H5Coro.read
function takes a list of dictionary objects that describe the datasets that need to be read in parallel.
If the block
parameter is set to True, then the code will wait for all of the datasets to be read before returning; otherwise, the code will return immediately and not until the dataset within the reader object is access will the code block.
The h5coro promise is a dictionary of numpy
arrays containing the values of the variables read, along with some additional logic that provides the ability to block while waiting for the data to be populated.
h5coro is licensed under the 3-clause BSD license found in the LICENSE file at the root of this source tree.
We welcome and invite contributions from anyone at any career stage and with any amount of coding experience towards the development of h5coro. We appreciate any and all contributions made towards the development of the project. You will be recognized for your work by being listed as one of the project contributors.
Check the project issues tab to see if the feature has already been suggested. If not, please submit a new issue describing your requested feature or enhancement. Please give your feature request both a clear title and description. Please let us know in your description if this is something you would like to contribute to the project.
Check the project issues tab to see if the problem has already been reported. If not, please submit a new issue so that we are made aware of the problem. Please provide as much detail as possible when writing the description of your bug report. Providing detailed information and examples will help us resolve issues faster.
We follow a standard Forking Workflow for code changes and additions. Submitted code goes through a review and comment process by the project maintainers.
Format Element | Supported | Contains | Missing |
---|---|---|---|
Field Sizes | Yes | 1, 2, 4, 8, bytes | |
Superblock | Partial | Version 0, 2 | Version 1, 3 |
Base Address | Yes | ||
B-Tree | Partial | Version 1 | Version 2 |
Group Symbol Table | Yes | Version 1 | |
Local Heap | Yes | Version 0 | |
Global Heap | No | Version 1 | |
Fractal Heap | Yes | Version 0 | |
Shared Object Header Message Table | No | Version 0 | |
Data Object Headers | Yes | Version 1, 2 | |
Shared Message | No | Version 1 | |
NIL Message | Yes | Unversioned | |
Dataspace Message | Yes | Version 1 | |
Link Info Message | Yes | Version 0 | |
Datatype Message | Partial | Version 1 | Version 0, 2, 3 |
Fill Value (Old) Message | No | Unversioned | |
Fill Value Message | Partial | Version 2, 3 | Version 1 |
Link Message | Yes | Version 1 | |
External Data Files Message | No | Version 1 | |
Data Layout Message | Partial | Version 3 | Version 1, 2 |
Bogus Message | No | Unversioned | |
Group Info Message | No | Version 0 | |
Filter Pipeline Message | Yes | Version 1, 2 | |
Attribute Message | Partial | Version 1, 2, 3 | Shared message support for v3 |
Object Comment Message | No | Unversioned | |
Object Modification Time (Old) Message | No | Unversioned | |
Shared Message Table Message | No | Version 0 | |
Object Header Continuation Message | Yes | Version 1, 2 | |
Symbol Table Message | Yes | Unversioned | |
Object Modification Time Message | No | Version 1 | |
B-Tree ‘K’ Value Message | No | Version 0 | |
Driver Info Message | No | Version 0 | |
Attribute Info Message | No | Version 0 | |
Object Reference Count Message | No | Version 0 | |
Compact Storage | Yes | ||
Continuous Storage | Yes | ||
Chunked Storage | Yes | ||
Fixed Point Type | Yes | ||
Floating Point Type | Yes | ||
Time Type | No | ||
String Type | Yes | ||
Bit Field Type | No | ||
Opaque Type | No | ||
Compound Type | No | ||
Reference Type | No | ||
Enumerated Type | No | ||
Variable Length Type | No | ||
Array Type | No | ||
Deflate Filter | Yes | ||
Shuffle Filter | Yes | ||
Fletcher32 Filter | No | ||
Szip Filter | No | ||
Nbit Filter | No | ||
Scale Offset Filter | No |