H5Py has several very useful methods for working with chunks that it would be helpful to support in pyfive:
On dataset:
iterchunks(sel=None)
id
On an id:
get_num_chunks (returns an integer)
get_chunk_info(index) (index in chunk order, returns a StoreInfo instance)
get_chunk_info_by_coord(tuple) (tuple is the index along each coordinate of the array, returns a StoreInfo instance)
`read_direct_chunk(
Documentation
iterchunks
Iterate over chunks in a chunked dataset. The optional sel argument is a slice or tuple of slices that defines the region to be used. If not set, the entire dataspace will be used for the iterator.
For each chunk within the given region, the iterator yields a tuple of slices that gives the intersection of the given chunk with the selection area. This can be used to read or write data in that chunk.
A TypeError will be raised if the dataset is not chunked.
A ValueError will be raised if the selection region is invalid.
id
Is not properly documented, you need to look at it's doc method:
Represents an HDF5 dataset identifier.
Objects of this class may be used in any HDF5 function which expects a dataset identifier.
Also, all H5D* functions which take a dataset instance as their first argument are presented as methods of this class.
Properties:
- dtype: Numpy dtype representing the dataset type
- shape: Numpy-style shape tuple representing the dataspace
- rank: Integer giving dataset rank
- Hashable: Yes, unless anonymous
- equality: True HDF5 identity if unless anonymous
StoreInfo
Looks like a Cpython class, but the attributes are what is important, e.g: StoreInfo(chunk_offset=(1, 0, 0), filter_mask=0, byte_offset=22805518, size=2593627) all attribute obtained directly. Probably implement this as a namedtuple in PyFive?
Reads data to a bytes array directly from a chunk at position specified by the offsets argument and bypasses any filters HDF5 would normally apply to the written data. However, the written data may be compressed or not.
Returns a tuple containing the filter_mask and the raw data storing this chunk as bytes if out is None, else as a memoryview.
filter_mask is a bit field of up to 32 values. It records which filters have been applied to this chunk, of the filter pipeline defined for that dataset. Each bit set to 1 means that the filter in the corresponding position in the pipeline was not applied to compute the raw data. So the default value of 0 means that all defined filters have been applied to the raw data.
If the out argument is not None, it must be a writeable contiguous 1D array-like of bytes (e.g., bytearray or numpy.ndarray) and large enough to contain the whole chunk.
with h5py.File("data.h5", "r") as f:
p = f["data"]
i = p.id
n = i.get_num_chunks()
for j in range(n):
c = i.get_chunk_info(j)
off_size.append((c.byte_offset, c.size))
for j in range(n):
_, c = i.read_direct_chunk((j, 0, 0))
b = zlib.decompress(c)
a = numpy.frombuffer(b, dtype=numpy.uint16)
assert (a == j).all()
H5Py has several very useful methods for working with chunks that it would be helpful to support in pyfive:
On dataset:
iterchunks(sel=None)
id
On an id:
get_num_chunks
(returns an integer)get_chunk_info(index)
(index in chunk order, returns aStoreInfo
instance)get_chunk_info_by_coord(tuple)
(tuple is the index along each coordinate of the array, returns aStoreInfo
instance)Documentation
iterchunks
id
Is not properly documented, you need to look at it's doc method:
StoreInfo
Looks like a Cpython class, but the attributes are what is important, e.g:
StoreInfo(chunk_offset=(1, 0, 0), filter_mask=0, byte_offset=22805518, size=2593627)
all attribute obtained directly. Probably implement this as anamedtuple
in PyFive?read_direct_chunk(offsets, PropID dxpl=None, out=None)
Reads data to a bytes array directly from a chunk at position specified by the offsets argument and bypasses any filters HDF5 would normally apply to the written data. However, the written data may be compressed or not.
Returns a tuple containing the filter_mask and the raw data storing this chunk as bytes if out is None, else as a memoryview.
filter_mask is a bit field of up to 32 values. It records which filters have been applied to this chunk, of the filter pipeline defined for that dataset. Each bit set to 1 means that the filter in the corresponding position in the pipeline was not applied to compute the raw data. So the default value of 0 means that all defined filters have been applied to the raw data.
If the out argument is not None, it must be a writeable contiguous 1D array-like of bytes (e.g., bytearray or numpy.ndarray) and large enough to contain the whole chunk.
Some usage (from a writeup of a talk by Graeme Winter):