Closed martinjanssens closed 8 months ago
Thanks for raising the issue! The c3ontext dataset is hosted on the IPFS network and so your file-request likely timed out as you could not find the providing peers in time. This is still a known issue and can be circumvented by providing an explicit list of peers that will likely provide the datasets in question. So instead of searching the whole IPFS network for the data, you provide a list of peers that should be contacted first. This will become especially handy when the big data center like DKRZ will join the network and host e.g. complete collections of experiments.
So how can you minimise the failure rate:
The use_ipfs
argument only refers to the intake catalog itself not to its entries. So use_ipfs
is great to cite a specific version of a catalog, but independent of this argument the entries can still be hosted on IPFS or not.
Please let me know if this helps!
Thanks for the elaborate answer, Hauke, and for patiently trying to educate me on how this all works. I've not been around IPFS before, but it looks really nice. Your suggestion makes sense. However, I wonder whether the root lies elsewhere, because I can actually get the data if I try loading it explicitly with xr.open_zarr
or xr.open_dataset
, i.e. something like
ds = xr.open_zarr('https://ipfs.io/ipfs/QmRDFjQ7Gxu6cHWFKaQXrodaujCu1VmvNKM3dpZsJzYt88')
works, so presumably I must be able to find a peer.
I've whittled it down to that this issue must be distribution or version-specific, because I can't reproduce the error on my laptop, where running the original example with exactly the same environment, i.e. versions of dask
, eurec4a
, fsspec
, intake
, xarray
and zarr
just works. I'm not savvy enough to figure out exactly what is going wrong, but at least working around the intake catalog is an option.
This is not really a solution, but merely some thoughts I've got around this.
The underlying ipfsspec
python library (that's the one which handles ipfs://
urls) tries to fetch the data via publicly available gateways if no local gateway (e.g. an IPFS daemon) is found. While this approach tries to be friendly to new users, there may still be some issues, e.g.:
QmWkQzqYMJMbFMGt6wy5jjYELmEx49QpwqXKa6aru2JMxj
in your case), it's fundamentally impossible to decide if it really doesn not exist or if the nodes hosting it are just very slow. Automatically switching between gateways could make things worse in this case, because a switch might happen just before one gateway would find the data. Using only one gateway and sticking to it would help here. But it's hard to automatically determine which gateway we want to stick to, thus ipfsspec
provides an environment variable IPFSSPEC_GATEWAYS
for this purpose.ipfsspec
and how it communicates errors back to fsspec
and zarr
. In particular, it might be better to always raise something like a timeout in cases where it's not possible to distinguish between not there and slow to access.There are also cases in which it's possible to say something isn't there with certainty, namely if the object QmWkQzqYMJMbFMGt6wy5jjYELmEx49QpwqXKa6aru2JMxj
would be found, one can inspect all of it's contents:
$ ipfs ls QmWkQzqYMJMbFMGt6wy5jjYELmEx49QpwqXKa6aru2JMxj
bafkreib6b3kjo7twbxufhvsxonasipxbkhcqwcziz6d2wo7ez5trflcelm 573 .zattrs
bafkreibdqn2g4z5uxtbhmkz7cahqnq72fvprjgvvvds5uxjtkikgjiazle 24 .zgroup
bafkreihe5gl3mit4gcmojxycp3nedzbdyufmaddy26ycqxo24kjnjovu7a 4650 .zmetadata
QmSQPHLq8C6aeDEyT38WYpDmy2x2vcVnUYRswYLAFhpVGL - date/
QmecUxSdXoEW35Ejj5jRPCDExdBFtrE7rhTY5RKgsGYwBb - freq/
QmaBAo5WQwgRbczSNvRbyyP7Bps15DL4uvLBEstftA3PDH - latitude/
QmTSr8XAoQa7jUzantjRZmznALUx6BoUx2FM5jMH9RDsG8 - longitude/
QmSmv5aD3t3N1vEsJ5aZP3e8rZFJ8qy3cJ6EhLwCxcj3kA - nb_users/
Qmd8BVQwEyJQgutu1zy9qqy4p8bhiC31FgpZntEPEd4LJS - pattern/
we can see that .zmetadata
exists (and will always exist in QmWkQzqYMJMbFMGt6wy5jjYELmEx49QpwqXKa6aru2JMxj
and will always have the CID bafkreihe5gl3mit4gcmojxycp3nedzbdyufmaddy26ycqxo24kjnjovu7a
), however e.g. foo.txt
does not exist (and will never exist in QmWkQzqYMJMbFMGt6wy5jjYELmEx49QpwqXKa6aru2JMxj
).
So likely the true error in your case is that an access to QmWkQzqYMJMbFMGt6wy5jjYELmEx49QpwqXKa6aru2JMxj
led to a timeout and then it was not possible to determine if .zmetadata
is there or not. Unfortunately it's likely hard to track this down, but the more information we can gather, the better we will become.
Thanks a lot for the elaborate thoughts: They are helpful! given what I think I understand from it, I don't have much to add, other than two notes:
Missing .zmetadata
error is a generic error message that doesn't really help track down what goes wrong. So for anyone else who might experience this: I didn't get far tracing the error stack, it eventually just led to an xr.open_dataset
function call, which then broke somewhere beyond my understanding and which does not break if I call it directly. The reason I'm not sure if it's a timeout error in the end, is that I have experienced those too, but they typically take a while (order 10s). The above fails almost immediately for me (<1s). But again I admit this is all circumstancial evidence.Thanks again.
Hi @martinjanssens,
just a short comment on your second point. It would be nice if HPC centres would run a systemwide IPFS daemon and take care that it can communicate well across firewalls etc. However, you still should be able to run your own local IPFS daemon even though you do not have root access. Please try the following:
IPFS_VERSION=0.12.0
wget https://dist.ipfs.io/go-ipfs/v${IPFS_VERSION}/go-ipfs_v${IPFS_VERSION}_linux-amd64.tar.gz
tar -xvzf go-ipfs_v${IPFS_VERSION}_linux-amd64.tar.gz
pushd go-ipfs
bash install.sh
popd
ipfs --version
ipfs init --profile server
curl https://raw.githubusercontent.com/eurec4a/ipfs_tools/main/add_peers.sh | bash
touch ipfs.log # ensure the file exists such that `tail` doesn't fail.
ipfs daemon 2>ipfs.log | grep -i -o -m1 'Daemon is ready' & tail -f --pid=$! ipfs.log
ipfs cat /ipfs/QmQPeNsJPyVWPFDVHb77w8G42Fvo15z4bG2X8D2GhfbSXc/readme
This is basically a copy of https://raw.githubusercontent.com/eurec4a/ipfs_tools/main/install_and_run_ipfs.sh mentioned earlier, but without using sudo
-commands.
I hope this helps.
I'm closing this for now as we have improved the availability of the IPFS pinned files. It remains a good idea though to run a local IPFS node before attempting to retrieve these files.
@observingClouds attempting to load C3ONTEXT following e.g. the example on howto.eurec4a, i.e.
returns me a
FileNotFoundError: QmWkQzqYMJMbFMGt6wy5jjYELmEx49QpwqXKa6aru2JMxj/.zmetadata
. I run into a similar problem when attempting to load otherc3ontext
sets, e.g.level3_IR_instant
, and when using theuse_ipfs
argument toget_intake_catalog()
. You can see the environment specs I'm running with here.