NCAS-CMS / pyfive

A pure Python HDF5 file reader
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Respect H5Py's interface for getting chunk addresses #4

Open bnlawrence opened 6 months ago

bnlawrence commented 6 months ago

H5Py has several very useful methods for working with chunks that it would be helpful to support in pyfive:

On dataset:

On an id:

iterchunks

id

Is not properly documented, you need to look at it's doc method:

    Represents an HDF5 dataset identifier.

    Objects of this class may be used in any HDF5 function which expects a dataset identifier.  
    Also, all H5D* functions which take a dataset  instance as their first argument are presented as methods of this class.

    Properties:
    - dtype:  Numpy dtype representing the dataset type
    - shape:  Numpy-style shape tuple representing the dataspace
    - rank:   Integer giving dataset rank
    - Hashable: Yes, unless anonymous
    - equality: True HDF5 identity if unless anonymous

StoreInfo

Looks like a Cpython class, but the attributes are what is important, e.g: StoreInfo(chunk_offset=(1, 0, 0), filter_mask=0, byte_offset=22805518, size=2593627) all attribute obtained directly. Probably implement this as a namedtuple in PyFive?

read_direct_chunk(offsets, PropID dxpl=None, out=None)

Reads data to a bytes array directly from a chunk at position specified by the offsets argument and bypasses any filters HDF5 would normally apply to the written data. However, the written data may be compressed or not.

Returns a tuple containing the filter_mask and the raw data storing this chunk as bytes if out is None, else as a memoryview.

filter_mask is a bit field of up to 32 values. It records which filters have been applied to this chunk, of the filter pipeline defined for that dataset. Each bit set to 1 means that the filter in the corresponding position in the pipeline was not applied to compute the raw data. So the default value of 0 means that all defined filters have been applied to the raw data.

If the out argument is not None, it must be a writeable contiguous 1D array-like of bytes (e.g., bytearray or numpy.ndarray) and large enough to contain the whole chunk.

Some usage (from a writeup of a talk by Graeme Winter):

with h5py.File("data.h5", "r") as f:
    p = f["data"]
    i = p.id
    n = i.get_num_chunks()

    for j in range(n):
        c = i.get_chunk_info(j)
        off_size.append((c.byte_offset, c.size))

    for j in range(n):
        _, c = i.read_direct_chunk((j, 0, 0))
        b = zlib.decompress(c)
        a = numpy.frombuffer(b, dtype=numpy.uint16)
        assert (a == j).all()