fsspec / gcsfs

Pythonic file-system interface for Google Cloud Storage
http://gcsfs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
345 stars 145 forks source link

Can't open gcsfuse-mounted HDF5 file with h5py #107

Open ryan-williams opened 6 years ago

ryan-williams commented 6 years ago

First I ran:

gcsfuse <bucket> /tmp/<bucket>

Then, attempting to open an HDF5 file with h5py:

>>> from h5py import *
>>> input = '/tmp/<bucket>/file'
>>> f = File(input, 'r')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "…/venv/lib/python3.6/site-packages/h5py/_hl/files.py", line 312, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "…/venv/lib/python3.6/site-packages/h5py/_hl/files.py", line 142, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

I'm not sure whether this is likely a bug in h5py or gcsfuse; I can see the file with e.g. ls or or.path.getsize, so gcsfuse seems to be doing its job, but my guess is that there is some syscall or access pattern in h5py that in not playing nicely with gcsfuse.

(mostly-irrelevant discussion of this error message in h5py)

martindurant commented 6 years ago

I haven't tried specifically with h5py , but we have seen success using xarray to open netcdf/hdf files via gcsfuse. You could turn on logging to see what's going on.

ryan-williams commented 6 years ago

Ah, thanks for the info!

Here's the output I see with logging on; relevant last bit:

DEBUG:gcsfs.gcsfuse:read(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 4096, 67108864, 0), kwargs={})
INFO:gcsfs.gcsfuse:read #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5) offset: 67108864, size: 4096
INFO:gcsfs.gcsfuse:cache miss
DEBUG:gcsfs.gcsfuse:read(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 4035, 67108925, 0), kwargs={})
INFO:gcsfs.gcsfuse:read #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5) offset: 67108925, size: 4035
INFO:gcsfs.gcsfuse:cache hit
DEBUG:gcsfs.gcsfuse:read(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 4096, 134217728, 0), kwargs={})
INFO:gcsfs.gcsfuse:read #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5) offset: 134217728, size: 4096
INFO:gcsfs.gcsfuse:cache miss
DEBUG:gcsfs.gcsfuse:read(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 4035, 134217789, 0), kwargs={})
INFO:gcsfs.gcsfuse:read #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5) offset: 134217789, size: 4035
INFO:gcsfs.gcsfuse:cache hit
DEBUG:gcsfs.gcsfuse:read(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 4096, 268435456, 0), kwargs={})
INFO:gcsfs.gcsfuse:read #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5) offset: 268435456, size: 4096
INFO:gcsfs.gcsfuse:cache miss
DEBUG:gcsfs.gcsfuse:read(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 4035, 268435517, 0), kwargs={})
INFO:gcsfs.gcsfuse:read #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5) offset: 268435517, size: 4035
INFO:gcsfs.gcsfuse:cache hit
DEBUG:gcsfs.gcsfuse:release(args=('/hca/immune-cell-census/ica_cord_blood_h5.h5', 0), kwargs={})
INFO:gcsfs.gcsfuse:close #0 (ll-sc-data/hca/immune-cell-census/ica_cord_blood_h5.h5)

Seems to be in a loop like:

The file in question is 437788306 bytes (418MB), so this process is ending at the 256MB mark; still not obvious why I end up with OSError: Unable to open file (file signature not found), but will report back if I figure it out.

Another confusing bit is that after an initial cache miss at offset 0, I see cache hits all the way up until 64MB, though I thought it should only be caching 2MB at a time. Probably unrelated, but lmk if it's clear what I'm missing there.

Thanks again.

martindurant commented 6 years ago

The caching is a subtle thing, I hope clear in the code of gcsfuse - there is a special case for the start of the file, since that metadata tends to get accessed a lot for all operations. So we keep the start of the file in memory regardless, but use read-ahead elsewhere. This is far from the biggest file that gcsfuse would have been used for - have you tried opening with xarray? Those were generally netCDF files, but the usage of the driver may be different enough for things to work for you.

shenghuanjie commented 4 years ago

Having the same issue here. Is there a solution for this? I'm using some existing pipelines with h5py.

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5d.pyx in h5py.h5d.DatasetID.read()

h5py/_proxy.pyx in h5py._proxy.dset_rw()

h5py/_proxy.pyx in h5py._proxy.H5PY_H5Dread()

OSError: Can't read data (file read failed: time = Thu Feb 20 18:25:04 2020
, filename = '/Users/shengh4/data/.../annotations.h5', file descriptor = 58, errno = 57, error message = 'Socket is not connected', buf = 0x7fd84eff0e00, total read size = 1411, bytes this sub-read = 1411, bytes actually read = 18446744073709551615, offset = 623200228)
martindurant commented 4 years ago

With a newer version of h5py, you shouldn't need to use FUSE at all, just open the file directly

of = gcs.open('filename', 'rb')
h = h5py.File(of)
shenghuanjie commented 4 years ago

With a newer version of h5py, you shouldn't need to use FUSE at all, just open the file directly

of = gcs.open('filename', 'rb')
h = h5py.File(of)

Thanks. I was using a existing conda environment. The problem has been resolved after update.