fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.05k stars 362 forks source link

issues with default block_size #302

Open suvarchal opened 4 years ago

suvarchal commented 4 years ago

Thanks for the wonderful library!

I have an issue using data from a remote http url, here is a example snippet ( originally, the issue arose while using the dataset in an intake catalog ) :

fs_of = fs.open_files('https://zenodo.org/record/3819896/files/Av.fesom.1948.nc?download=1')
of = fs_of[0].open()
xr.open_dataset(of)
......
~/miniconda3/envs/pyfesom2/lib/python3.7/site-packages/fsspec/caching.py in _fetch(self, start, end)
    337         ):
    338             # First read, or extending both before and after
--> 339             self.cache = self.fetcher(start, bend)
    340             self.start = start
    341         elif start < self.start:

~/miniconda3/envs/pyfesom2/lib/python3.7/site-packages/fsspec/implementations/http.py in _fetch_range(self, start, end)
    317             else:
    318                 raise ValueError(
--> 319                     "Got more bytes (%i) than requested (%i)" % (cl, end - start)
    320                 )
    321         else:
ValueError: Got more bytes (13470624) than requested (5242888)

with a little digging i figured that fspec made range-request in http.py while server sent entire file and the response was larger then block_size.

Here is response header from the server:

!curl -I "https://zenodo.org/record/3819896/files/Av.fesom.1948.nc?download=1"

HTTP/1.1 200 OK
Server: nginx/1.16.1
Content-Type: application/octet-stream
Content-Length: 13470624
Content-MD5: 008f40a425030d78cc7de2b2ecec12be
Content-Disposition: attachment; filename=Av.fesom.1948.nc
.....
ETag: "md5:008f40a425030d78cc7de2b2ecec12be"
Last-Modified: Mon, 11 May 2020 09:23:33 GMT
Date: Thu, 21 May 2020 21:40:03 GMT
Accept-Ranges: none
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 59
X-RateLimit-Reset: 1590097264
Retry-After: 60
....

when server says Accept-Range:none or in some other cases when it doesn't exist, shouldn't we by default do a raw download or fetchall instead of attempting range requests by default? I think this would add default support to lot of datasets on servers( outdated?) like zenodo etc.

it does work when using block_size = filesize or higher like:

fs_of=fs.open_files('https://zenodo.org/record/3819896/files/Av.fesom.1948.nc?download=1', block_size=13470624) # or some high value
of = fs_of[0].open()
xr.open_dataset(of)

, but that is a little burden on the user. For some reason block_size=0 did not work, i am not sure if that is from xarray or fsspec, so leaving that out for now.

martindurant commented 4 years ago

Sorry for the slow reply.

when server says Accept-Range:none or in some other cases when it doesn't exist, shouldn't we by default do a raw download or fetchall instead of attempting range requests by default

Yes, I think that would be reasonable. To be sure, the HTTP backend has many special cases for various server behaviour. We are trying to provide a random-access file interface, but in many cases this is hard or not possible. In those cases, you need to either a) provide a streaming interface (i.e., you can read but not seek), or download the whole thing into memory. In this case you do know the size of the target, and you can make a reasonable guess of what to do, but that isn't always the case either.

Perhaps it would be good to refactor the code in HTTPFileSystem (and HTTPFile) to clearly enumerate the set of conditions for server responses/capabilities, and user options on what to do about them. I'm still not sure we can cover all possible cases.

martindurant commented 4 years ago

You might want to try with master now, since the HTTP interface has been substantially rewritten. The File logic is mostly the same though, so it's possible that this issue is still with us.

suvarchal commented 4 years ago

unfortunately issue remains:

>>> import fsspec as fs
>>> import xarray as xr
>>> fs_of = fs.open_files('https://zenodo.org/record/3819896/files/Av.fesom.1948.nc?download=1')
>>> of = fs_of[0].open()
>>> xr.open_dataset(of)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/xarray/backends/api.py", line 529, in open_dataset
    engine = _get_engine_from_magic_number(filename_or_obj)
  File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/xarray/backends/api.py", line 123, in _get_engine_from_magic_number
    magic_number = filename_or_obj.read(8)
  File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/implementations/http.py", line 331, in read
    return super().read(length)
  File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/spec.py", line 1308, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/caching.py", line 333, in _fetch
    self.cache = self.fetcher(start, bend)
  File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/asyn.py", line 100, in wrapper
    return maybe_sync(func, self, *args, **kwargs)
  File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/asyn.py", line 80, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/asyn.py", line 51, in sync
    raise exc.with_traceback(tb)
  File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/asyn.py", line 35, in f
    result[0] = await future
  File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/implementations/http.py", line 376, in async_fetch_range
    "Got more bytes (%i) than requested (%i)" % (cl, end - start)
ValueError: Got more bytes (13470624) than requested (5242888)
martindurant commented 4 years ago

I should have replied to:

For some reason block_size=0 did not work, i am not sure if that is from xarray or fsspec, so leaving that out for now.

This doesn't work because the library needs random access, which you can't do on a stream. The only random-access solution, as you found, is to download the whole thing into a memory buffer and random-access that - but we can't support this in general, of course, because we'll fill users' memory.

The real solution for you is probably one of the local caching implementations. These should be able to download arbitrary sized files (perhaps needing blocksize=0) regardless of server capabilities, and indeed should now be changed to call fs.get instead of open when there is no compression, since get is often optimised.

martindurant commented 4 years ago

The other "real solution" is to use zarr format instead of nc :)

aaronspring commented 2 years ago

another option for fetching data from zenodo is taking the link from export JSON and using simplecache:: which downloads the data first.

Ex: intake_xarray.NetCDFSource('simplecache::https://zenodo.org/api/files/5463d51c-e9e1-4766-8bf5-543d25a62450/SBIO10_Mean_Temperature_of_Warmest_Quarter_5_15cm.tif', xarray_kwargs=dict(engine='rasterio'), chunks='auto').to_dask() with URL from https://zenodo.org/record/4558732/export/json for record https://zenodo.org/record/4558732

EDIT: the JSON url is not needed. works also: intake_xarray.NetCDFSource('simplecache::https://zenodo.org/record/4558732/files/soilT_2_5_15cm.tif', xarray_kwargs=dict(engine='rasterio'), chunks='auto').to_dask()