eurec4a / eurec4a-intake

Intake catalogue for EUREC4A field campaign datasets
17 stars 19 forks source link

Missing .zmetadata error when attempting to load C3ONTEXT data through the intake catalog #106

Closed martinjanssens closed 8 months ago

martinjanssens commented 2 years ago

@observingClouds attempting to load C3ONTEXT following e.g. the example on howto.eurec4a, i.e.

import dask
import eurec4a
cat = eurec4a.get_intake_catalog()
ds = cat.c3ontext.level3_IR_daily.to_dask()

returns me a FileNotFoundError: QmWkQzqYMJMbFMGt6wy5jjYELmEx49QpwqXKa6aru2JMxj/.zmetadata. I run into a similar problem when attempting to load other c3ontext sets, e.g. level3_IR_instant, and when using the use_ipfs argument to get_intake_catalog(). You can see the environment specs I'm running with here.

observingClouds commented 2 years ago

Thanks for raising the issue! The c3ontext dataset is hosted on the IPFS network and so your file-request likely timed out as you could not find the providing peers in time. This is still a known issue and can be circumvented by providing an explicit list of peers that will likely provide the datasets in question. So instead of searching the whole IPFS network for the data, you provide a list of peers that should be contacted first. This will become especially handy when the big data center like DKRZ will join the network and host e.g. complete collections of experiments.

So how can you minimise the failure rate:

The use_ipfs argument only refers to the intake catalog itself not to its entries. So use_ipfs is great to cite a specific version of a catalog, but independent of this argument the entries can still be hosted on IPFS or not.

Please let me know if this helps!

martinjanssens commented 2 years ago

Thanks for the elaborate answer, Hauke, and for patiently trying to educate me on how this all works. I've not been around IPFS before, but it looks really nice. Your suggestion makes sense. However, I wonder whether the root lies elsewhere, because I can actually get the data if I try loading it explicitly with xr.open_zarr or xr.open_dataset, i.e. something like

ds = xr.open_zarr('https://ipfs.io/ipfs/QmRDFjQ7Gxu6cHWFKaQXrodaujCu1VmvNKM3dpZsJzYt88')

works, so presumably I must be able to find a peer.

I've whittled it down to that this issue must be distribution or version-specific, because I can't reproduce the error on my laptop, where running the original example with exactly the same environment, i.e. versions of dask, eurec4a, fsspec, intake, xarray and zarr just works. I'm not savvy enough to figure out exactly what is going wrong, but at least working around the intake catalog is an option.

d70-t commented 1 year ago

This is not really a solution, but merely some thoughts I've got around this.

The underlying ipfsspec python library (that's the one which handles ipfs:// urls) tries to fetch the data via publicly available gateways if no local gateway (e.g. an IPFS daemon) is found. While this approach tries to be friendly to new users, there may still be some issues, e.g.:


There are also cases in which it's possible to say something isn't there with certainty, namely if the object QmWkQzqYMJMbFMGt6wy5jjYELmEx49QpwqXKa6aru2JMxj would be found, one can inspect all of it's contents:

$ ipfs ls QmWkQzqYMJMbFMGt6wy5jjYELmEx49QpwqXKa6aru2JMxj                                                                                                                           
bafkreib6b3kjo7twbxufhvsxonasipxbkhcqwcziz6d2wo7ez5trflcelm 573  .zattrs
bafkreibdqn2g4z5uxtbhmkz7cahqnq72fvprjgvvvds5uxjtkikgjiazle 24   .zgroup
bafkreihe5gl3mit4gcmojxycp3nedzbdyufmaddy26ycqxo24kjnjovu7a 4650 .zmetadata
QmSQPHLq8C6aeDEyT38WYpDmy2x2vcVnUYRswYLAFhpVGL              -    date/
QmecUxSdXoEW35Ejj5jRPCDExdBFtrE7rhTY5RKgsGYwBb              -    freq/
QmaBAo5WQwgRbczSNvRbyyP7Bps15DL4uvLBEstftA3PDH              -    latitude/
QmTSr8XAoQa7jUzantjRZmznALUx6BoUx2FM5jMH9RDsG8              -    longitude/
QmSmv5aD3t3N1vEsJ5aZP3e8rZFJ8qy3cJ6EhLwCxcj3kA              -    nb_users/
Qmd8BVQwEyJQgutu1zy9qqy4p8bhiC31FgpZntEPEd4LJS              -    pattern/

we can see that .zmetadata exists (and will always exist in QmWkQzqYMJMbFMGt6wy5jjYELmEx49QpwqXKa6aru2JMxj and will always have the CID bafkreihe5gl3mit4gcmojxycp3nedzbdyufmaddy26ycqxo24kjnjovu7a), however e.g. foo.txt does not exist (and will never exist in QmWkQzqYMJMbFMGt6wy5jjYELmEx49QpwqXKa6aru2JMxj).

So likely the true error in your case is that an access to QmWkQzqYMJMbFMGt6wy5jjYELmEx49QpwqXKa6aru2JMxj led to a timeout and then it was not possible to determine if .zmetadata is there or not. Unfortunately it's likely hard to track this down, but the more information we can gather, the better we will become.

martinjanssens commented 1 year ago

Thanks a lot for the elaborate thoughts: They are helpful! given what I think I understand from it, I don't have much to add, other than two notes:

Thanks again.

observingClouds commented 1 year ago

Hi @martinjanssens,

just a short comment on your second point. It would be nice if HPC centres would run a systemwide IPFS daemon and take care that it can communicate well across firewalls etc. However, you still should be able to run your own local IPFS daemon even though you do not have root access. Please try the following:

IPFS_VERSION=0.12.0

wget https://dist.ipfs.io/go-ipfs/v${IPFS_VERSION}/go-ipfs_v${IPFS_VERSION}_linux-amd64.tar.gz
tar -xvzf go-ipfs_v${IPFS_VERSION}_linux-amd64.tar.gz
pushd go-ipfs
bash install.sh
popd
ipfs --version
ipfs init --profile server
curl https://raw.githubusercontent.com/eurec4a/ipfs_tools/main/add_peers.sh | bash
touch ipfs.log  # ensure the file exists such that `tail` doesn't fail.
ipfs daemon 2>ipfs.log | grep -i -o -m1 'Daemon is ready' & tail -f --pid=$! ipfs.log
ipfs cat /ipfs/QmQPeNsJPyVWPFDVHb77w8G42Fvo15z4bG2X8D2GhfbSXc/readme

This is basically a copy of https://raw.githubusercontent.com/eurec4a/ipfs_tools/main/install_and_run_ipfs.sh mentioned earlier, but without using sudo-commands.

I hope this helps.

observingClouds commented 8 months ago

I'm closing this for now as we have improved the availability of the IPFS pinned files. It remains a good idea though to run a local IPFS node before attempting to retrieve these files.