ELEKTRONN / elektronn3

A PyTorch-based library for working with 3D and 2D convolutional neural networks, with focus on semantic segmentation of volumetric biomedical image data
MIT License
161 stars 27 forks source link

Random HDF5 read errors #12

Closed mdraw closed 5 years ago

mdraw commented 6 years ago

Once in a while, data loaders (especially validation data loader) encounter a random read error when slicing from HDF5 files at https://github.com/ELEKTRONN/elektronn3/blob/e4dff1b9b9c44794a6ecc2c3fcf440f047451367/elektronn3/data/utils.py#L44 The end of the traceback looks like this:

[...] self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5d.pyx", line 181, in h5py.h5d.DatasetID.read File "h5py/_proxy.pyx", line 130, in h5py._proxy.dset_rw File "h5py/_proxy.pyx", line 84, in h5py._proxy.H5PY_H5Dread OSError: Can't read data (wrong B-tree signature)

Attempting to read from the same source coordinates again usually works, which is why it's wrapped in a retry-block and doesn't affect training. It's still very annoying to have this issue.

Quoting from https://github.com/ELEKTRONN/elektronn3/commit/0ed440886e426774def66f0eacf6a7f6225ca883:

Since the errors are not deterministic, I guess they are either caused by a concurrency issue in PyTorch's DataLoader, in HDF5/h5py or maybe it's even a filesystem issue. (One of the error messages can be found in the commit message of e1a55ed.)

mdraw commented 5 years ago

Resolved with 4039fcc.