Open ryan-williams opened 6 years ago
I haven't tried specifically with h5py , but we have seen success using xarray to open netcdf/hdf files via gcsfuse. You could turn on logging to see what's going on.
Ah, thanks for the info!
Here's the output I see with logging on; relevant last bit:
DEBUG:gcsfs.gcsfuse:read(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 4096, 67108864, 0), kwargs={})
INFO:gcsfs.gcsfuse:read #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5) offset: 67108864, size: 4096
INFO:gcsfs.gcsfuse:cache miss
DEBUG:gcsfs.gcsfuse:read(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 4035, 67108925, 0), kwargs={})
INFO:gcsfs.gcsfuse:read #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5) offset: 67108925, size: 4035
INFO:gcsfs.gcsfuse:cache hit
DEBUG:gcsfs.gcsfuse:read(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 4096, 134217728, 0), kwargs={})
INFO:gcsfs.gcsfuse:read #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5) offset: 134217728, size: 4096
INFO:gcsfs.gcsfuse:cache miss
DEBUG:gcsfs.gcsfuse:read(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 4035, 134217789, 0), kwargs={})
INFO:gcsfs.gcsfuse:read #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5) offset: 134217789, size: 4035
INFO:gcsfs.gcsfuse:cache hit
DEBUG:gcsfs.gcsfuse:read(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 4096, 268435456, 0), kwargs={})
INFO:gcsfs.gcsfuse:read #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5) offset: 268435456, size: 4096
INFO:gcsfs.gcsfuse:cache miss
DEBUG:gcsfs.gcsfuse:read(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 4035, 268435517, 0), kwargs={})
INFO:gcsfs.gcsfuse:read #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5) offset: 268435517, size: 4035
INFO:gcsfs.gcsfuse:cache hit
DEBUG:gcsfs.gcsfuse:release(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 0), kwargs={})
INFO:gcsfs.gcsfuse:close #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5)
Seems to be in a loop like:
The file in question is 437788306 bytes (418MB), so this process is ending at the 256MB mark; still not obvious why I end up with OSError: Unable to open file (file signature not found)
, but will report back if I figure it out.
Another confusing bit is that after an initial cache miss at offset 0, I see cache hits all the way up until 64MB, though I thought it should only be caching 2MB at a time. Probably unrelated, but lmk if it's clear what I'm missing there.
Thanks again.
The caching is a subtle thing, I hope clear in the code of gcsfuse - there is a special case for the start of the file, since that metadata tends to get accessed a lot for all operations. So we keep the start of the file in memory regardless, but use read-ahead elsewhere. This is far from the biggest file that gcsfuse would have been used for - have you tried opening with xarray? Those were generally netCDF files, but the usage of the driver may be different enough for things to work for you.
Having the same issue here. Is there a solution for this? I'm using some existing pipelines with h5py.
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5d.pyx in h5py.h5d.DatasetID.read()
h5py/_proxy.pyx in h5py._proxy.dset_rw()
h5py/_proxy.pyx in h5py._proxy.H5PY_H5Dread()
OSError: Can't read data (file read failed: time = Thu Feb 20 18:25:04 2020
, filename = '/Users/shengh4/data/.../annotations.h5', file descriptor = 58, errno = 57, error message = 'Socket is not connected', buf = 0x7fd84eff0e00, total read size = 1411, bytes this sub-read = 1411, bytes actually read = 18446744073709551615, offset = 623200228)
With a newer version of h5py, you shouldn't need to use FUSE at all, just open the file directly
of = gcs.open('filename', 'rb')
h = h5py.File(of)
With a newer version of h5py, you shouldn't need to use FUSE at all, just open the file directly
of = gcs.open('filename', 'rb') h = h5py.File(of)
Thanks. I was using a existing conda environment. The problem has been resolved after update.
First I ran:
Then, attempting to open an HDF5 file with h5py:
I'm not sure whether this is likely a bug in h5py or gcsfuse; I can see the file with e.g.
ls
oror.path.getsize
, so gcsfuse seems to be doing its job, but my guess is that there is some syscall or access pattern in h5py that in not playing nicely with gcsfuse.(mostly-irrelevant discussion of this error message in h5py)