SlideRuleEarth / h5coro

The HDF5 Cloud Optimized Read Only Python Package
BSD 3-Clause "New" or "Revised" License
24 stars 4 forks source link

h5coro

A cloud optimized Python package for reading HDF5 data stored in S3

Origin and Purpose

h5coro is a pure Python implementation of a subset of the HDF5 specification that has been optimized for reading data out of S3. The project has its roots in the development of an on-demand science data processing system called SlideRule, where a new C++ implementation of the HDF5 specification was developed for performant read access to Earth science datasets stored in AWS S3. Over time, user's of SlideRule began requesting the ability to performantly read HDF5 and NetCDF files out of S3 from their own Python scripts. The result is h5coro: the re-implementation in Python of the core HDF5 reading logic that exists in SlideRule. Since then, h5coro has become its own project, which will continue to grow and diverge in functionality from its parent implementation. For more information on SlideRule and the organization behind h5coro, see https://slideruleearth.io.

h5coro is optimized for reading HDF5 data in high-latency high-throughput environments. It accomplishes this through a few key design decisions:

Limitations

For a full list of which parts of the HDF5 specification h5coro implements, see the compatibility section at the end of this readme. The major limitations currently present in the package are:

Installation

The simplest way to install h5coro is by using the conda package manager.

    conda install -c conda-forge h5coro

Alternatively, you can also install h5coro using pip.

    pip install h5coro

xarray backend

To use h5coro as a backend to xarray, simply install both xarray and h5coro in your current environment. h5coro will automatically be recognized by xarray, so you can use it like any other xarray engine:

import xarray as xr
h5ds = xr.open_dataset("file.h5", engine="h5coro")

You can see what backends are available in xarray using:

xr.backends.list_engines()

Example Usage

# (1) import
from h5coro import h5coro, s3driver

# (2) create
h5obj = h5coro.H5Coro(f'{my_bucket}/{path_to_hdf5_file}', s3driver.S3Driver)

# (3) read
datasets = [{'dataset': '/path/to/dataset1', 'hyperslice': []},
            {'dataset': '/path/to/dataset2', 'hyperslice': [324, 374]}]
promise = h5obj.readDatasets(datasets=datasets, block=True)

# (4) display
for variable in promise:
    print(f'{variable}: {promise[variable]}')

(1) Importing h5coro

h5coro: the main module implementing the HDF5 reader object

s3driver: the driver used to read HDF5 data from S3

(2) Create h5coro Object

The call to h5coro.H5Coro creates a reader object that opens up the HDF5 file, reads the start of the file, and is then ready to accept read requests.

The calling application must have credentials to access the object in the specified S3 bucket. h5coro uses boto3, so any credentials supplied via the standard AWS methods will work. If credentials need to be supplied externally, then in the call to h5coro.H5Coro pass in an argument credentials as a dictionary with the following three fields: "aws_access_key_id", "aws_secret_access_key", "aws_session_token".

(3) Read with h5coro Object

The H5Coro.read function takes a list of dictionary objects that describe the datasets that need to be read in parallel.

If the block parameter is set to True, then the code will wait for all of the datasets to be read before returning; otherwise, the code will return immediately and not until the dataset within the reader object is access will the code block.

(4) Display the Datasets

The h5coro promise is a dictionary of numpy arrays containing the values of the variables read, along with some additional logic that provides the ability to block while waiting for the data to be populated.

Licensing

h5coro is licensed under the 3-clause BSD license found in the LICENSE file at the root of this source tree.

Contribute

We welcome and invite contributions from anyone at any career stage and with any amount of coding experience towards the development of h5coro. We appreciate any and all contributions made towards the development of the project. You will be recognized for your work by being listed as one of the project contributors.

Ways to Contribute

Requesting a Feature

Check the project issues tab to see if the feature has already been suggested. If not, please submit a new issue describing your requested feature or enhancement. Please give your feature request both a clear title and description. Please let us know in your description if this is something you would like to contribute to the project.

Reporting a Bug

Check the project issues tab to see if the problem has already been reported. If not, please submit a new issue so that we are made aware of the problem. Please provide as much detail as possible when writing the description of your bug report. Providing detailed information and examples will help us resolve issues faster.

Contributing Code or Examples

We follow a standard Forking Workflow for code changes and additions. Submitted code goes through a review and comment process by the project maintainers.

General Guidelines

Steps to Contribute

Compatibility

Format Element Supported Contains Missing
Field Sizes Yes 1, 2, 4, 8, bytes
Superblock Partial Version 0, 2 Version 1, 3
Base Address Yes
B-Tree Partial Version 1 Version 2
Group Symbol Table Yes Version 1
Local Heap Yes Version 0
Global Heap No Version 1
Fractal Heap Yes Version 0
Shared Object Header Message Table No Version 0
Data Object Headers Yes Version 1, 2
Shared Message No Version 1
NIL Message Yes Unversioned
Dataspace Message Yes Version 1
Link Info Message Yes Version 0
Datatype Message Partial Version 1 Version 0, 2, 3
Fill Value (Old) Message No Unversioned
Fill Value Message Partial Version 2, 3 Version 1
Link Message Yes Version 1
External Data Files Message No Version 1
Data Layout Message Partial Version 3 Version 1, 2
Bogus Message No Unversioned
Group Info Message No Version 0
Filter Pipeline Message Yes Version 1, 2
Attribute Message Partial Version 1, 2, 3 Shared message support for v3
Object Comment Message No Unversioned
Object Modification Time (Old) Message No Unversioned
Shared Message Table Message No Version 0
Object Header Continuation Message Yes Version 1, 2
Symbol Table Message Yes Unversioned
Object Modification Time Message No Version 1
B-Tree ‘K’ Value Message No Version 0
Driver Info Message No Version 0
Attribute Info Message No Version 0
Object Reference Count Message No Version 0
Compact Storage Yes
Continuous Storage Yes
Chunked Storage Yes
Fixed Point Type Yes
Floating Point Type Yes
Time Type No
String Type Yes
Bit Field Type No
Opaque Type No
Compound Type No
Reference Type No
Enumerated Type No
Variable Length Type No
Array Type No
Deflate Filter Yes
Shuffle Filter Yes
Fletcher32 Filter No
Szip Filter No
Nbit Filter No
Scale Offset Filter No