Open suvarchal opened 4 years ago
Sorry for the slow reply.
when server says Accept-Range:none or in some other cases when it doesn't exist, shouldn't we by default do a raw download or fetchall instead of attempting range requests by default
Yes, I think that would be reasonable. To be sure, the HTTP backend has many special cases for various server behaviour. We are trying to provide a random-access file interface, but in many cases this is hard or not possible. In those cases, you need to either a) provide a streaming interface (i.e., you can read
but not seek
), or download the whole thing into memory. In this case you do know the size of the target, and you can make a reasonable guess of what to do, but that isn't always the case either.
Perhaps it would be good to refactor the code in HTTPFileSystem (and HTTPFile) to clearly enumerate the set of conditions for server responses/capabilities, and user options on what to do about them. I'm still not sure we can cover all possible cases.
You might want to try with master now, since the HTTP interface has been substantially rewritten. The File logic is mostly the same though, so it's possible that this issue is still with us.
unfortunately issue remains:
>>> import fsspec as fs
>>> import xarray as xr
>>> fs_of = fs.open_files('https://zenodo.org/record/3819896/files/Av.fesom.1948.nc?download=1')
>>> of = fs_of[0].open()
>>> xr.open_dataset(of)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/xarray/backends/api.py", line 529, in open_dataset
engine = _get_engine_from_magic_number(filename_or_obj)
File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/xarray/backends/api.py", line 123, in _get_engine_from_magic_number
magic_number = filename_or_obj.read(8)
File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/implementations/http.py", line 331, in read
return super().read(length)
File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/spec.py", line 1308, in read
out = self.cache._fetch(self.loc, self.loc + length)
File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/caching.py", line 333, in _fetch
self.cache = self.fetcher(start, bend)
File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/asyn.py", line 100, in wrapper
return maybe_sync(func, self, *args, **kwargs)
File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/asyn.py", line 80, in maybe_sync
return sync(loop, func, *args, **kwargs)
File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/asyn.py", line 51, in sync
raise exc.with_traceback(tb)
File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/asyn.py", line 35, in f
result[0] = await future
File "/home/suvarchal/miniconda3/lib/python3.7/site-packages/fsspec/implementations/http.py", line 376, in async_fetch_range
"Got more bytes (%i) than requested (%i)" % (cl, end - start)
ValueError: Got more bytes (13470624) than requested (5242888)
I should have replied to:
For some reason block_size=0 did not work, i am not sure if that is from xarray or fsspec, so leaving that out for now.
This doesn't work because the library needs random access, which you can't do on a stream. The only random-access solution, as you found, is to download the whole thing into a memory buffer and random-access that - but we can't support this in general, of course, because we'll fill users' memory.
The real solution for you is probably one of the local caching implementations. These should be able to download arbitrary sized files (perhaps needing blocksize=0) regardless of server capabilities, and indeed should now be changed to call fs.get
instead of open
when there is no compression, since get
is often optimised.
The other "real solution" is to use zarr format instead of nc :)
another option for fetching data from zenodo is taking the link from export JSON
and using simplecache::
which downloads the data first.
Ex: intake_xarray.NetCDFSource('simplecache::https://zenodo.org/api/files/5463d51c-e9e1-4766-8bf5-543d25a62450/SBIO10_Mean_Temperature_of_Warmest_Quarter_5_15cm.tif', xarray_kwargs=dict(engine='rasterio'), chunks='auto').to_dask()
with URL from https://zenodo.org/record/4558732/export/json for record https://zenodo.org/record/4558732
EDIT: the JSON
url is not needed. works also: intake_xarray.NetCDFSource('simplecache::https://zenodo.org/record/4558732/files/soilT_2_5_15cm.tif', xarray_kwargs=dict(engine='rasterio'), chunks='auto').to_dask()
Thanks for the wonderful library!
I have an issue using data from a remote http url, here is a example snippet ( originally, the issue arose while using the dataset in an intake catalog ) :
with a little digging i figured that fspec made range-request in http.py while server sent entire file and the response was larger then block_size.
Here is response header from the server:
when server says Accept-Range:none or in some other cases when it doesn't exist, shouldn't we by default do a raw download or fetchall instead of attempting range requests by default? I think this would add default support to lot of datasets on servers( outdated?) like zenodo etc.
it does work when using block_size = filesize or higher like:
, but that is a little burden on the user. For some reason
block_size=0
did not work, i am not sure if that is from xarray or fsspec, so leaving that out for now.