HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
109 stars 39 forks source link

Should there be a chunk iterator for writing datasets using 'create_dataset'? #88

Open jbhatch opened 4 years ago

jbhatch commented 4 years ago

When writing an HDF5 file to the HSDS with H5PYD, it appears that although chunks are being created in the final output file, the initial writing of the data seems to operate in a contiguous manner. This would sometimes produce interrupts (http request errors) when writing large, ~GB-size HDF5 files (~GB-size) with H5PYD to the HSDS despite having more than enough memory in each of the HSDS data nodes. Writing smaller, ~MB-sized files was hit and miss, and KB-sized files had no issues. The 3D datasets in the HDF5 files of varied sizes (~GB, ~MB, and ~KB-size) used in these tests were filled with 3D random numpy arrays.

In order to use the H5PYD Chunk Interator in create_dataset, the following fix is suggested:

The line below is added to the import statements in the group.py file in h5pyd/_hl:

from h5pyd._apps.chunkiter import ChunkIterator

In the group.py file under h5pyd/_hl, change lines 334-336 from this:

if data is not None: self.log.info("initialize data") dset[...] = data

to this:

    if data is not None:
        self.log.info("initialize data")
        # dset[...] = data
        it = ChunkIterator(dset)
        for chunk in it:
           dset[chunk] = data[chunk]
jreadey commented 4 years ago

In the h5pyd dataset.py that's a good solution for initializing the dataset.

There's a max request size limit (defaults to 100mb) so there server will respond with a 413 error if you try to write more than that much data in one request. I don't know if that explains the problems you had with writing larger datasets or not.

I'd been planning to make changes that would paginate large writes - basically have the code for dset[...] = data send multiple requests to the server if that data is too large. Read operations are supported this way now. Your approach would be easier to implement since it just needs to deal with the dataset initialization. Have you tried making this change yourself?