fsspec / ipfsspec

readonly python fsspec implementation for IPFS
MIT License
21 stars 10 forks source link

Loading of zarr dataset fails due to missing "ETag" in server response. #17

Closed observingClouds closed 1 week ago

observingClouds commented 2 years ago

What happened While trying to open the dataset zarr dataset bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu with

import xarray as xr
xr.open_dataset("ipfs://bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu", engine="zarr")

a KeyError is sometimes raised:

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    493 
    494     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495     backend_ds = backend.open_dataset(
    496         filename_or_obj,
    497         drop_variables=drop_variables,

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/xarray/backends/zarr.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel)
    798 
    799         filename_or_obj = _normalize_path(filename_or_obj)
--> 800         store = ZarrStore.open_group(
    801             filename_or_obj,
    802             group=group,

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, stacklevel)
    363                     stacklevel=stacklevel,
    364                 )
--> 365                 zarr_group = zarr.open_group(store, **open_kwargs)
    366         elif consolidated:
    367             # TODO: an option to pass the metadata_key keyword

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/zarr/hierarchy.py in open_group(store, mode, cache_attrs, synchronizer, path, chunk_store, storage_options)
   1165 
   1166     # handle polymorphic store arg
-> 1167     store = _normalize_store_arg(
   1168         store, storage_options=storage_options, mode=mode
   1169     )

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/zarr/hierarchy.py in _normalize_store_arg(store, storage_options, mode)
   1055     if store is None:
   1056         return MemoryStore()
-> 1057     return normalize_store_arg(store,
   1058                                storage_options=storage_options, mode=mode)
   1059 

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/zarr/storage.py in normalize_store_arg(store, storage_options, mode)
    112     if isinstance(store, str):
    113         if "://" in store or "::" in store:
--> 114             return FSStore(store, mode=mode, **(storage_options or {}))
    115         elif storage_options:
    116             raise ValueError("storage_options passed with non-fsspec path")

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/zarr/storage.py in __init__(self, url, normalize_keys, key_separator, mode, exceptions, dimension_separator, **storage_options)
   1138         # Pass attributes to array creation
   1139         self._dimension_separator = dimension_separator
-> 1140         if self.fs.exists(self.path) and not self.fs.isdir(self.path):
   1141             raise FSPathExistNotDir(url)
   1142 

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
     84     def wrapper(*args, **kwargs):
     85         self = obj or args[0]
---> 86         return sync(self.loop, func, *args, **kwargs)
     87 
     88     return wrapper

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs)
     64         raise FSTimeoutError from return_result
     65     elif isinstance(return_result, BaseException):
---> 66         raise return_result
     67     else:
     68         return return_result

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout)
     24         coro = asyncio.wait_for(coro, timeout=timeout)
     25     try:
---> 26         result[0] = await coro
     27     except Exception as ex:
     28         result[0] = ex

~/mambaforge/envs/how_to_eurec4a/lib/python3.8/site-packages/fsspec/asyn.py in _isdir(self, path)
    531     async def _isdir(self, path):
    532         try:
--> 533             return (await self._info(path))["type"] == "directory"
    534         except IOError:
    535             return False

KeyError: 'type'

Expected behaviour The dataset is returned without any error.

Potential causes Debugging the above call

by inserting a few print statements into async_ipfs.py
    async def file_info(self, path, session):
        info = {"name": path}

        headers = {"Accept-Encoding": "identity"}  # this ensures correct file size
        res = await self.cid_head(path, session, headers=headers)

        async with res:
            self._raise_not_found_for_status(res, path)
            if res.status != 200:
                # TODO: maybe handle 301 here
                raise FileNotFoundError(path)
            if "Content-Length" in res.headers:
                info["size"] = int(res.headers["Content-Length"])
            elif "Content-Range" in res.headers:
                info["size"] = int(res.headers["Content-Range"].split("/")[1])

            if "ETag" in res.headers:
                etag = res.headers["ETag"].strip("\"")
                info["ETag"] = etag
                if etag.startswith("DirIndex"):
                    info["type"] = "directory"
                    info["CID"] = etag.split("-")[-1]
                else:
                    info["type"] = "file"
                    info["CID"] = etag

        print(f"Info: {info}", flush=True)  # debug print
        print(res.status)  # debug print
        print(res.headers)  # debug print
        return info

reveals that the "ETag" is not always returned by the server. While the header looks like

Info: {'name': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 'ETag': 'DirIndex-2b567f6r5vvdg_CID-bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 'type': 'directory', 'CID': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu'}
200
<CIMultiDictProxy('Server': 'openresty', 'Date': 'Sat, 18 Jun 2022 23:03:06 GMT', 'Content-Type': 'text/html', 
'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Access-Control-Allow-Methods': 'GET', 
'Etag': '"DirIndex-2b567f6r5vvdg_CID-bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu"', 
'X-Ipfs-Gateway-Host': 'ipfs-bank6-fr2', 
'X-Ipfs-Path': '/ipfs/bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 
'X-Ipfs-Roots': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 
'X-IPFS-POP': 'ipfs-bank6-fr2', 'Access-Control-Allow-Origin': '*', 
'Access-Control-Allow-Methods': 'GET, POST, OPTIONS',
 'Access-Control-Allow-Headers': 'X-Requested-With, Range, Content-Range, X-Chunked-Output, X-Stream-Output', 
'Access-Control-Expose-Headers': 'Content-Range, X-Chunked-Output, X-Stream-Output', 
'X-IPFS-LB-POP': 'gateway-bank2-fr2',
 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'X-Proxy-Cache': 'MISS')>
Info: {'name': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 
'ETag': 'DirIndex-2b567f6r5vvdg_CID-bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu', 
'type': 'directory',
'CID': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu'}

for a successful request, it misses the "ETag" when failing:

Info: {'name': 'bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu'}
200
<CIMultiDictProxy('Server': 'nginx/1.14.0 (Ubuntu)', 'Date': 'Sun, 19 Jun 2022 10:10:27 GMT', 'Content-Type': 'text/html', 
'Connection': 'keep-alive', 'Access-Control-Allow-Headers': 'Content-Type', 
'Access-Control-Allow-Headers': 'Range', 'Access-Control-Allow-Headers': 'User-Agent', 
'Access-Control-Allow-Headers': 'X-Requested-With', 
'Access-Control-Allow-Methods': 'GET', 'Access-Control-Allow-Methods': 'HEAD',
 'Access-Control-Allow-Origin': '*', 'Access-Control-Expose-Headers': 'Content-Range',
 'Access-Control-Expose-Headers': 'X-Chunked-Output', 
'Access-Control-Expose-Headers': 'X-Stream-Output', 
'X-Ipfs-Path': '/ipfs/bafybeidqwf7lcs4mo343ntgxiid7n6psvryicuqkppm3wmzad2wdamnpsu')>

Without the "ETag" the "type"-Key is not set. https://github.com/fsspec/ipfsspec/blob/8eb96dfb0ffdebb099a47a77d2b4653988b4a0b8/ipfsspec/async_ipfs.py#L45-L53

Does this mean that the success of the function call seems to depend on which IPFS peer is responding quickest?

d70-t commented 2 years ago

This is related to https://github.com/ipfs/go-ipfs/issues/8528: we need a way of telling if a CID or IPFS-path resolves to a directory or to a file (that's needed for fsspec's info()-method as well as isdir(), isfile() etc...

According to the issue mentioned above, cheking the ETag is an awkward but recommended way of doing this. Apparently it does not work in all cases. Probably we'll have to exclude some gateways from out default list, if they dropped support for this or otherwise have to find ways of telling files and directories apart from what we get.

d70-t commented 2 years ago

So apparently https://gateway.pinata.cloud doesn't return etags, but is able to deliver the dataset. That's unfortunate, but I don't see a good way of getting what we need for info() from their response. Thus we might have to drop that gateway from the default list...

observingClouds commented 2 years ago

Thanks for looking into this! This is a pity, maybe we should approach them and inform them about this issue with their service.

So, a quick solution would be to define the environment variable IPFSSPEC_GATEWAYS and just exclude the piñata gateway or any other gateway that does not provide etags. I can work with that for now, but I agree that the gateway should be dropped from the default list so the UX is better.

d70-t commented 1 week ago

I'm currently not able to retrieve the referenced dataset anymore. However, since version 0.5.0, ipfsspec shouldn't depend on ETags anymore, thus I'd assume that this error doesn't exist anymore and I'll close the issue.