fsspec / s3fs

S3 Filesystem
http://s3fs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
854 stars 270 forks source link

trouble loading netcdf4 files with xarray on s3 #168

Closed scottyhq closed 5 years ago

scottyhq commented 5 years ago

I'm working on allowing direct access to netcdf4/hdf5 file-like objects (https://github.com/pydata/xarray/pull/2782). This seems to be working fine with gcsfs, but not s3fs (versions 0.2 from conda-forge). Here is a gist with the relevant code and error traceback:

https://gist.github.com/scottyhq/304a3c4b4e198776b8d82fb3a9f300e3

and an abbreviated traceback here:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/Documents/GitHub/xarray/xarray/backends/file_manager.py in acquire(self, needs_lock)
    166             try:
--> 167                 file = self._cache[self._key]
    168             except KeyError:

~/Documents/GitHub/xarray/xarray/backends/lru_cache.py in __getitem__(self, key)
     40         with self._lock:
---> 41             value = self._cache[key]
     42             self._cache.move_to_end(key)

KeyError: [<function _open_h5netcdf_group at 0x11d8b0ae8>, (<S3File grfn-content-prod/S1-GUNW-A-R-137-tops-20181129_20181123-020010-43220N_41518N-PP-e2c7-v2_0_0.nc>,), 'r', (('group', '/science/grids/data'),)]

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/s3fs/core.py in readinto(self, b)
   1498         data = self.read()
-> 1499         b[:len(data)] = data
   1500         return len(data)

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView.memoryview.__setitem__()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView.memoryview.setitem_slice_assignment()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView.memoryview_copy_contents()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-darwin.so in View.MemoryView._err_extents()

ValueError: got differing extents in dimension 0 (got 8 and 59941567)

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

~/miniconda3/envs/test_env/lib/python3.6/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set

any guidance as to what might be going on here would be appreciated!

martindurant commented 5 years ago

That is quite a traceback! I am surprised that gcsfs worked, rather than that s3fs did not - hdf5 is a C-level reader that likes to have a local real file to read from. I have heard that they have tried to allow for python file-like objects, but I don't know how that's implemented - apparently something is subtly different between the two file implementation classes.

leroygr commented 5 years ago

Same problem for me, I can't read a netCDF on S3 using s3fs with h5netcdf:

>>> s3 = s3fs.S3FileSystem(key=os.environ['AWS_DS_AGENT_KEY_ID'],
                    secret=os.environ['AWS_DS_AGENT_ACCESS_KEY'])
>>> fileobj = s3.open(s3_fp)
>>> nc = h5netcdf.File(fileobj,'r', invalid_netcdf=True)
Traceback

```python --------------------------------------------------------------------------- TypeError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in readinto(self, b) 1498 data = self.read() -> 1499 b[:len(data)] = data 1500 return len(data) ~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/h5fd.cpython-37m-x86_64-linux-gnu.so in View.MemoryView.memoryview.__setitem__() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/h5fd.cpython-37m-x86_64-linux-gnu.so in View.MemoryView.memoryview.setitem_slice_assign_scalar() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/h5fd.cpython-37m-x86_64-linux-gnu.so in View.MemoryView._memoryviewslice.assign_item_from_object() TypeError: an integer is required The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') SystemError: PyEval_EvalFrameEx returned a result with an error set The above exception was the direct cause of the following exception: SystemError Traceback (most recent call last) in ----> 1 nc = h5netcdf.File(fileobj,'r', invalid_netcdf=True) ~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5netcdf/core.py in __init__(self, path, mode, invalid_netcdf, **kwargs) 603 else: 604 self._preexisting_file = mode in {'r', 'r+', 'a'} --> 605 self._h5file = h5py.File(path, mode, **kwargs) 606 except Exception: 607 self._closed = True ~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, **kwds) 392 fid = make_fid(name, mode, userblock_size, 393 fapl, fcpl=make_fcpl(track_order=track_order), --> 394 swmr=swmr) 395 396 if swmr_support: ~/miniconda3/envs/s3test/lib/python3.7/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr) 168 if swmr and swmr_support: 169 flags |= h5f.ACC_SWMR_READ --> 170 fid = h5f.open(name, flags, fapl=fapl) 171 elif mode == 'r+': 172 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl) h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/h5f.pyx in h5py.h5f.open() h5py/defs.pyx in h5py.defs.H5Fopen() h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read() ~/miniconda3/envs/s3test/lib/python3.7/site-packages/s3fs/core.py in seek(self, loc, whence) 1234 from start of file, current location or end of file, resp. 1235 """ -> 1236 if not self.readable(): 1237 raise ValueError('Seek only available in read mode') 1238 if whence == 0: ```

martindurant commented 5 years ago

If you can post the file somewhere public, I can try to find out what's going on.

pbranson commented 5 years ago

I am keen to see a way to do this without a fuse mount - here is an open file:

import xarray as xr
import s3fs
fs = s3fs.S3FileSystem(anon=True)
s3path = 'imos-data/IMOS/SRS/OC/gridded/aqua/P1D/2010/05/A.P1D.20100507T000000Z.aust.ipar.nc'

fobj = fs.open(s3path)
ds = xr.open_dataset(fobj,engine='h5netcdf')

Produces roughly the same stack trace

martindurant commented 5 years ago

I am getting

      5
      6 fobj = fs.open(s3path)
----> 7 ds = xr.open_dataset(fobj,engine='h5netcdf')

~/anaconda/envs/py36/lib/python3.6/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs)
    345     else:
    346         if engine is not None and engine != 'scipy':
--> 347             raise ValueError('can only read file-like objects with '
    348                              "default engine or engine='scipy'")
    349         # assume filename_or_obj is a file-like object

ValueError: can only read file-like objects with default engine or engine='scipy'

however, the problem for you seems to be here:

-> 1499         b[:len(data)] = data

The message implies that that the data being inserted is the wrong size; it would be good to debug at that point to see what the buffer b and data contain.

pbranson commented 5 years ago

Yes - I think there were some recent updates that changed this behaviour to allow h5netcdf with file-like objects

I was using these versions from pip to get the previously mentioned error:

h5netcdf-0.7.1 h5py-2.9.0 pytz-2018.9 xarray-0.12.0

On Wed, Apr 3, 2019 at 9:36 AM Martin Durant notifications@github.com wrote:

I am getting

  5
  6 fobj = fs.open(s3path)

----> 7 ds = xr.open_dataset(fobj,engine='h5netcdf')

~/anaconda/envs/py36/lib/python3.6/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs) 345 else: 346 if engine is not None and engine != 'scipy': --> 347 raise ValueError('can only read file-like objects with ' 348 "default engine or engine='scipy'") 349 # assume filename_or_obj is a file-like object

ValueError: can only read file-like objects with default engine or engine='scipy'

however, the problem for you seems to be here:

-> 1499 b[:len(data)] = data

The message implies that that the data being inserted is the wrong size; it would be good to debug at that point to see what the buffer b and data contain.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/s3fs/issues/168#issuecomment-479289616, or mute the thread https://github.com/notifications/unsubscribe-auth/AM3bQKm461Vg3QIEDlOcEDeoWk8xFVfOks5vdAWwgaJpZM4bTe4Y .

pbranson commented 5 years ago

You are correct - if I check back through the stack trace I am getting the error here:

/opt/conda/lib/python3.6/site-packages/s3fs/core.py in readinto(self, b)
   1498         data = self.read()
-> 1499         b[:len(data)] = data
   1500         return len(data)

/opt/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView.memoryview.__setitem__()

/opt/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView.memoryview.setitem_slice_assignment()

/opt/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView.memoryview_copy_contents()

/opt/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView._err_extents()

ValueError: got differing extents in dimension 0 (got 8 and 24247921)

I will have a go at setting a breakpoint there and taking a look

pbranson commented 5 years ago

Not entirely clear to me how all these libraries tie together but it seems that

h5py:files.py 170 fid = h5f.open(name, flags, fapl=fapl)

is calling out to the hdf5 c-library function "H5Fopen", which is expecting a string filepath, where as name at this point is an s3fs.S3File object. Somehow passing the "name" parameter is invoking

s3fs:core.py

1497        def readinto(self, b):
                    data = self.read()
                    b[:len(data)] = data
                    return len(data)

where b is a <MemoryView of 'array' at 0x1b458a73be0> but I cant workout where b is instantiated - it only has a length of 8, where as the binary data read from s3 is much larger - hence the exception.

However even if that succeeded I dont know how this would work anyway, given the c library is expecting a string file path, rather than a binary memory view?

I am surprised that gcsfs worked, rather than that s3fs did not - hdf5 is a C-level reader that likes to have a local real file to read from.

If I get time I will take a look at why the gcsfs is working

martindurant commented 5 years ago

OK, so something "new" :) I would suspect that the memoryview has a complex type other than bytes, and s3fs is trying to fill the buffer with bytes (although it doesn't appear to be an exact multiple). readinto is very rarely used anywhere, surprised to see it, but I suppose the memory must have been allocated in C-land.

btw: the difficulties with hdf are the main reason for interest in libraries like zarr (or zarr as a backend for netcdf), which is known to work well with s3fs/gcsfs/etc. It may or may not be a viable alternative for you.

pbranson commented 5 years ago

Thanks Martin - testing out a 'simple' (albeit slow) way converting of converting netCDF that is already present in cloud storage into Zarr format, with out formally 'mirroring' the files locally somewhere.

Noting that if performance is crucial and you have the AWS budget, then setting up AWS FSx is the obvious way to go (i.e. https://jiaweizhuang.github.io/blog/fsx-experiments/)

On Wed, Apr 3, 2019 at 10:32 PM Martin Durant notifications@github.com wrote:

OK, so something "new" :) I would suspect that the memoryview has a complex type other than bytes, and s3fs is trying to fill the buffer with bytes (although it doesn't appear to be an exact multiple). readinto is very rarely used anywhere, surprised to see it, but I suppose the memory must have been allocated in C-land.

btw: the difficulties with hdf are the main reason for interest in libraries like zarr (or zarr as a backend for netcdf), which is known to work well with s3fs/gcsfs/etc. It may or may not be a viable alternative for you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/s3fs/issues/168#issuecomment-479515416, or mute the thread https://github.com/notifications/unsubscribe-auth/AM3bQDEM5VDX0QSbBQ8-XDMwLV1dDS0yks5vdLttgaJpZM4bTe4Y .

rsignell-usgs commented 5 years ago

@martindurant, I'm also hitting this as well. Was hoping to follow up on @rabernat's suggestion to include this in our testing of different options for accessing NetCDF4/HDF5 on s3 (in addition to Zarr and HSDS).

I got the same error you did:

ValueError: can only read file-like objects with default engine or engine='scipy'

when I forgot to install h5netcdf and h5py into my environment.

martindurant commented 5 years ago

@pbranson , did you manage to learn anything about this issue?

martindurant commented 5 years ago

Hm, with everything updated (h5netcdf , h5py, s3fs), the first invocation just worked:

In [5]: fs = s3fs.S3FileSystem(anon=True)
   ...: s3path = 'imosdata/IMOS/SRS/OC/gridded/aqua/P1D/2010/05/A.P1D.20100507T000000Z.aust.ipar.nc'
   ...:
   ...: fobj = fs.open(s3path)
In [6]: nc = h5netcdf.File(fobj, 'r', invalid_netcdf=True)
In [7]: nc
Out[7]:
<h5netcdf.File 'A.P1D.20100507T000000Z.aust.ipar.nc>' (mode r)>
Dimensions:
    latitude: 7001
    longitude: 10001
    time: 1
Groups:
Variables:
    time: ('time',) float64
    latitude: ('latitude',) float64
    longitude: ('longitude',) float64
    ipar: ('time', 'latitude', 'longitude') float32
Attributes:
    history: b'File initialised at 2015-12-17T19:03:50.793738\nInitialised var ipar at 2015-12-17T19:04:36.563452\nAdd Granule A20100507_0230.20150923161152.L2OC_BASE.ipar.nc at 2015-12-17T19:04:38.498914\nAdd Granule A20100507_0235.20150923151200.L2OC_BASE.ipar.nc at 2015-12-17T19:04:38.975299\nAdd Granule A20100507_0240.20150923134822.L2OC_BASE.ipar.nc at 2015-12-17T19:04:39.483551\nAdd Granule A20100507_0245.20150923143121.L2OC_BASE.ipar.nc at 2015-12-17T19:04:39.793043\nAdd Granule A20100507_0405.20150923141146.L2OC_BASE.ipar.nc at 2015-12-17T19:04:40.401902\nAdd Granule A20100507_0410.20150923162326.L2OC_BASE.ipar.nc at 2015-12-17T19:04:40.977119\nAdd Granule A20100507_0415.20150923133857.L2OC_BASE.ipar.nc at 2015-12-17T19:04:41.430398\nAdd Granule A20100507_0420.20150923150036.L2OC_BASE.ipar.nc at 2015-12-17T19:04:41.923474\nAdd Granule A20100507_0540.20150923152402.L2OC_BASE.ipar.nc at 2015-12-17T19:04:42.336277\nAdd Granule A20100507_0545.20150923154421.L2OC_BASE.ipar.nc at 2015-12-17T19:04:43.116328\nAdd Granule A20100507_0550.20150923140042.L2OC_BASE.ipar.nc at 2015-12-17T19:04:43.709527\nAdd Granule A20100507_0555.20150923155628.L2OC_BASE.ipar.nc at 2015-12-17T19:04:44.321537\nAdd Granule A20100507_0600.20150923165701.L2OC_BASE.ipar.nc at 2015-12-17T19:04:44.871419\nAdd Granule A20100507_0720.20150923142308.L2OC_BASE.ipar.nc at 2015-12-17T19:04:45.394833\nAdd Granule A20100507_0725.20150923132636.L2OC_BASE.ipar.nc at 2015-12-17T19:04:46.131246\nAdd Granule A20100507_0730.20150923163350.L2OC_BASE.ipar.nc at 2015-12-17T19:04:46.614609\nAdd Granule A20100507_0735.20150923153102.L2OC_BASE.ipar.nc at 2015-12-17T19:04:47.083167\nAdd Granule A20100507_0740.20150923144622.L2OC_BASE.ipar.nc at 2015-12-17T19:04:47.608014'
    Conventions: b'CF-1.6'
    source_path: b'imos-srs/archive/oc/aqua/1d/v201508/2010/05/A20100507.L2OC_BASE.aust.ipar.ncimos-srs/archive/oc/aqua/1d/v201508/2010/05/A20100507.L2OC_BASE.aust.ipar.ncimos-srs/archive/oc/aqua/1d/v201508/2010/05/A20100507.L2OC_BASE.aust.ipar.nc'

but via xarray it does not.

rabernat commented 5 years ago

@martindurant - how are you invoking xarray?

martindurant commented 5 years ago
fobj = fs.open(s3path)
ds = xr.open_dataset(fobj, engine='h5netcdf')
rabernat commented 5 years ago

That's weird, because it does work with gcsfs.

What xarray error are you getting?

martindurant commented 5 years ago

ValueError: can only read file-like objects with default engine or engine='scipy' (also a few comments higher up the thread)

rabernat commented 5 years ago

Could we point this discussion to a public file instead? That would make debugging easier for me. I don't have any credentials to try the file in question.

When I try with the ERA5 public data, I can't even open it with h5py

fs = s3fs.S3FileSystem(anon=True)
s3path = 'era5-pds/2008/01/data/air_temperature_at_2_metres.nc'
file_obj = fs.open(s3path)
h5 = h5py.File(file_obj, 'r')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

/srv/conda/lib/python3.6/site-packages/s3fs/core.py in readinto(self, b)
   1498         data = self.read()
-> 1499         b[:len(data)] = data
   1500         return len(data)

/srv/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView.memoryview.__setitem__()

/srv/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView.memoryview.setitem_slice_assignment()

/srv/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView.memoryview_copy_contents()

/srv/conda/lib/python3.6/site-packages/h5py/h5fd.cpython-36m-x86_64-linux-gnu.so in View.MemoryView._err_extents()

ValueError: got differing extents in dimension 0 (got 8 and 1157316538)

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_read()

/srv/conda/lib/python3.6/site-packages/s3fs/core.py in seek(self, loc, whence)
   1235         """
-> 1236         if not self.readable():
   1237             raise ValueError('Seek only available in read mode')

SystemError: PyEval_EvalFrameEx returned a result with an error set
martindurant commented 5 years ago

OK, solved it - and it seems this only happens for some specific files! The reason it works with gcsfs is that it simply doesn't have a readinto method (but it should!), so it seems h5py falls back to read

martindurant commented 5 years ago

Sorry that it took do long for me to dig this out!

rsignell-usgs commented 5 years ago

I was excited to try this out, but my simple test below is not working for some reason:

import xarray as xr
import s3fs
import h5netcdf

print(xr.__version__)
print(s3fs.__version__)
print(h5netcdf.__version__)

fs = s3fs.S3FileSystem(anon=True)
fileObj = fs.open('esip-pangeo/pangeo/adcirc/adcirc_01.nc')
print(fileObj.info())

produces:

0.12.1
0.2.1
0.7.1
{'ETag': '"79ca97f44f5fed750f6dea35a16f6ac9-4986"', 'Key': 'esip-pangeo/pangeo/adcirc/adcirc_01.nc', 'LastModified': datetime.datetime(2019, 4, 12, 17, 46, 44, tzinfo=tzutc()), 'Size': 26140007264, 'StorageClass': 'STANDARD', 'VersionId': None}

but then this causes the kernel to die:

ds = xr.open_dataset(fileObj, engine='h5netcdf', chunks={'time':10, 'node':141973})

@martindurant , any ideas?

martindurant commented 5 years ago

@rabernat , you verified this working for some other .nc files, correct? A dead kernel suggests an exception in the C library, which would be very hard to diagnose. Running

h5py.File(fileObj, 'r')

has not caused an error for me yet, but it seems to be downloading everything and filling up memory (I know the file is extremely big), so it's possible that the metadata is laid out in a particularly unfriendly way. That still doesn't explain your crash. Perhaps would be better with default_fill_cache=False for the fs.

I am looking into implementing https://github.com/dask/s3fs/pull/177/ across all filesystems in fsspec, which would be just the thing for a case like this.

rsignell-usgs commented 5 years ago

@martindurant , yes!

fs = s3fs.S3FileSystem(anon=True, default_fill_cache=False)
fileObj = fs.open('esip-pangeo/pangeo/adcirc/adcirc_01.nc')
ds = xr.open_dataset(fileObj, engine='h5netcdf', chunks={'time':10, 'node':141973})

works within a few seconds!

martindurant commented 5 years ago

Good, but also annoying! Making options that tend to work for most people most of the time is hard...

martindurant commented 5 years ago

(I suppose this is why you want to encode all the options required for smooth working of a particular dataset into a catalog...)

rsignell-usgs commented 5 years ago

@martindurant do you think this a s3fs, xarray, h5netcdf or h5py issue? 😕

import xarray as xr
import s3fs

fs = s3fs.S3FileSystem(anon=True, default_fill_cache=False)
fileObj = fs.open('nwm-archive/2010/201001110000.CHRTOUT_DOMAIN1.comp')
print(fileObj.size)
ds = xr.open_dataset(fileObj, engine='h5netcdf')

which fails with:

18815129
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-99def1d8f6d1> in <module>()
      5 fileObj = fs.open('nwm-archive/2010/201001110000.CHRTOUT_DOMAIN1.comp')
      6 print(fileObj.size)
----> 7 ds = xr.open_dataset(fileObj, engine='h5netcdf')

/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime)
    392 
    393     with close_on_error(store):
--> 394         ds = maybe_decode_store(store)
    395 
    396     # Ensure source filename always stored in dataset object (GH issue #2550)

/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in maybe_decode_store(store, lock)
    322             store, mask_and_scale=mask_and_scale, decode_times=decode_times,
    323             concat_characters=concat_characters, decode_coords=decode_coords,
--> 324             drop_variables=drop_variables, use_cftime=use_cftime)
    325 
    326         _protect_dataset_variables_inplace(ds, cache)

/opt/conda/lib/python3.6/site-packages/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime)
    468         encoding = obj.encoding
    469     elif isinstance(obj, AbstractDataStore):
--> 470         vars, attrs = obj.load()
    471         extra_coords = set()
    472         file_obj = obj

/opt/conda/lib/python3.6/site-packages/xarray/backends/common.py in load(self)
    118         """
    119         variables = FrozenOrderedDict((_decode_variable_name(k), v)
--> 120                                       for k, v in self.get_variables().items())
    121         attributes = FrozenOrderedDict(self.get_attrs())
    122         return variables, attributes

/opt/conda/lib/python3.6/site-packages/xarray/backends/h5netcdf_.py in get_variables(self)
    135     def get_variables(self):
    136         return FrozenOrderedDict((k, self.open_store_variable(k, v))
--> 137                                  for k, v in self.ds.variables.items())
    138 
    139     def get_attrs(self):

/opt/conda/lib/python3.6/site-packages/xarray/core/utils.py in FrozenOrderedDict(*args, **kwargs)
    330 
    331 def FrozenOrderedDict(*args, **kwargs):
--> 332     return Frozen(OrderedDict(*args, **kwargs))
    333 
    334 

/opt/conda/lib/python3.6/site-packages/xarray/backends/h5netcdf_.py in <genexpr>(.0)
    135     def get_variables(self):
    136         return FrozenOrderedDict((k, self.open_store_variable(k, v))
--> 137                                  for k, v in self.ds.variables.items())
    138 
    139     def get_attrs(self):

/opt/conda/lib/python3.6/site-packages/xarray/backends/h5netcdf_.py in open_store_variable(self, name, var)
    101         data = indexing.LazilyOuterIndexedArray(
    102             H5NetCDFArrayWrapper(name, self))
--> 103         attrs = _read_attributes(var)
    104 
    105         # netCDF4 specific encoding

/opt/conda/lib/python3.6/site-packages/xarray/backends/h5netcdf_.py in _read_attributes(h5netcdf_var)
     42     # bytes attributes to strings
     43     attrs = OrderedDict()
---> 44     for k, v in h5netcdf_var.attrs.items():
     45         if k not in ['_FillValue', 'missing_value']:
     46             v = maybe_decode_bytes(v)

/opt/conda/lib/python3.6/_collections_abc.py in __iter__(self)
    742     def __iter__(self):
    743         for key in self._mapping:
--> 744             yield (key, self._mapping[key])
    745 
    746 ItemsView.register(dict_items)

/opt/conda/lib/python3.6/site-packages/h5netcdf/attrs.py in __getitem__(self, key)
     17         if key in _HIDDEN_ATTRS:
     18             raise KeyError(key)
---> 19         return self._h5attrs[key]
     20 
     21     def __setitem__(self, key, value):

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

/opt/conda/lib/python3.6/site-packages/h5py/_hl/attrs.py in __getitem__(self, name)
     79 
     80         arr = numpy.ndarray(shape, dtype=dtype, order='C')
---> 81         attr.read(arr, mtype=htype)
     82 
     83         if len(arr.shape) == 0:

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5a.pyx in h5py.h5a.AttrID.read()

h5py/_proxy.pyx in h5py._proxy.attr_rw()

OSError: Unable to read attribute (no appropriate function for conversion path)
martindurant commented 5 years ago

None of that traceback appears to be in s3fs - are you sure it loads OK from local? If yes, then finding the problem will be tricky, as apparently, any exception is being hidden.

rabernat commented 5 years ago

I'm curious how these caches behave with dask / distributed. Are the cache contents serialized, or is the cache cleared before pickling the file object?

martindurant commented 5 years ago

Are the cache contents serialized, or is the cache cleared before pickling the file object?

The files are not sent around at all. What you actually send is an OpenFile object ( https://github.com/dask/dask/blob/master/dask/bytes/core.py#L143 ), which only creates the S3FileSystem object in a with block - so caches do not survive tasks.

martindurant commented 5 years ago

@rsignell-usgs , how did you comment from the future? :)

martindurant commented 5 years ago

do you think this a s3fs, xarray, h5netcdf or h5py issue

I would exclude xarray here. What happens within h5py when it calls s3 is a bit of a mystery - perhaps more logging in s3fs would help, set logger "s3fs.core" to DEBUG and you'll get some.

rsignell-usgs commented 5 years ago

@martindurant , I can read the file locally with xarray using the netcdf4 engine, but not with the h5netcdf engine. I can also read the file locally using h5py, so I guess that make it a h5netcdf issue.

Thanks!

rsignell-usgs commented 5 years ago

I'm also really confused about how I managed to post a comment 5 hours from now. 🙄

martindurant commented 5 years ago

Are the cache contents serialized, or is the cache cleared before pickling the file object?

PS: the file-system is serialised in this process, including directory listings. This is good or bad - you avoid potentially slow lookups when opening the file, but the instance is bigger. I notice that gcsfs does not preserve the listings cache. gcsfs came later and is, in some ways, better designed (hence my attempt to consolidate such things into fsspec).

pbranson commented 5 years ago

If I take a slice from a netcdf opened with s3fs+h5netcdf is it doing some form of byte range request or essentially downloading the entire file into a memory cache and then slicing?

In which case we should always chunk on a file basis when using this method?

On Wed., 8 May 2019, 11:19 pm Martin Durant, notifications@github.com wrote:

Are the cache contents serialized, or is the cache cleared before pickling the file object?

PS: the file-system is serialised in this process, including directory listings. This is good or bad - you avoid potentially slow lookups when opening the file, but the instance is bigger. I notice that gcsfs does not preserve the listings cache. gcsfs came later and is, in some ways, better designed (hence my attempt to consolidate such things into fsspec).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/s3fs/issues/168#issuecomment-490529693, or mute the thread https://github.com/notifications/unsubscribe-auth/ADG5WQCGSUQI7DCVYQS4TZ3PULVQ5ANCNFSM4G2N5YMA .

martindurant commented 5 years ago

I don't know the internals of h5netcdf, but i would hope it's a range. You could time reading a whole array versus reading a single value; but it will not be linear, due to fixed costs of each connection and metadata lookups. For a slice, it would depend on exact layout and chunking. You may want to turn on s3fs debug logging.

On May 12, 2019 8:35:48 PM EDT, Paul Branson notifications@github.com wrote:

If I take a slice from a netcdf opened with s3fs+h5netcdf is it doing some> form of byte range request or essentially downloading the entire file into> a memory cache and then slicing?>

In which case we should always chunk on a file basis when using this method?>

On Wed., 8 May 2019, 11:19 pm Martin Durant, notifications@github.com> wrote:>

Are the cache contents serialized, or is the cache cleared before pickling> the file object?>

PS: the file-system is serialised in this process, including directory> listings. This is good or bad - you avoid potentially slow lookups when> opening the file, but the instance is bigger. I notice that gcsfs does not> preserve the listings cache. gcsfs came later and is, in some ways, better> designed (hence my attempt to consolidate such things into fsspec).>

—> You are receiving this because you were mentioned.> Reply to this email directly, view it on GitHub> https://github.com/dask/s3fs/issues/168#issuecomment-490529693, or mute> the thread>

https://github.com/notifications/unsubscribe-auth/ADG5WQCGSUQI7DCVYQS4TZ3PULVQ5ANCNFSM4G2N5YMA> .>

-- > You are receiving this because you were mentioned.> Reply to this email directly or view it on GitHub:> https://github.com/dask/s3fs/issues/168#issuecomment-491642673

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.