ecmwf / earthkit

Apache License 2.0
23 stars 3 forks source link

Download issue - Land Cover data from CDS #32

Open gritk opened 4 months ago

gritk commented 4 months ago

What happened?

The download of Land Cover data from CDS is probably not possible. This is needed for the development of tutorials and use cases foreseen in the C3S-LOT5 contract.

What are the steps to reproduce the bug?

If you are using the pre-downloaded data then please set DOWNLOAD_FROM_CDS to False and set the LOCAL_DATA_DIR to where you stored the data. DOWNLOAD_FROM_CDS = True LOCAL_DATA_DIR = "../data/"

if DOWNLOAD_FROM_CDS: lc_data = ek.data.from_source( "cds", 'satellite-land-cover', { 'year': '2022', 'version': 'v2.1.1', 'variable': 'all', 'format': 'zip', } ) lc_data.save(f"{LOCAL_DATA_DIR}/lc_2022.zip") else: lc_data = ek.data.from_source("file", f"{LOCAL_DATA_DIR}/lc_2022.zip")

Version

Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] Type 'copyright', 'credits' or 'license' for more information IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help.

Platform (OS and architecture)

Windows 11 Pro

Relevant log output

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[4], line 7
      4 LOCAL_DATA_DIR = "../data/"
      6 if DOWNLOAD_FROM_CDS:
----> 7     lc_data = ek.data.from_source(
      8         "cds",
      9     'satellite-land-cover',
     10     {
     11         'year': '2022',
     12         'version': 'v2.1.1',
     13         'variable': 'all',
     14         'format': 'zip',
     15     }
     16     )
     19     # # This command was used to save the data files in our managed storage,
     20     # #  they are not required for the notebook to run, and your computer will cache the 
     21     # #  results so you don't have to download again
     22     lc_data.save(f"{LOCAL_DATA_DIR}/lc_2022.zip")

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/__init__.py:143, in from_source(name, lazily, *args, **kwargs)
    140     return from_source_lazily(name, *args, **kwargs)
    142 prev = None
--> 143 src = get_source(name, *args, **kwargs)
    144 while src is not prev:
    145     prev = src

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/__init__.py:124, in SourceMaker.__call__(self, name, *args, **kwargs)
    117 klass = find_plugin(os.path.dirname(__file__), name, loader)
    119 # if os.environ.get("FIEDLIST_TESTING_ENABLE_MOCKUP_SOURCE", False):
    120 #     from earthkit.data.mockup import SourceMockup
    121 
    122 #     klass = SourceMockup
--> 124 source = klass(*args, **kwargs)
    126 if getattr(source, "name", None) is None:
    127     source.name = name

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/core/__init__.py:21, in MetaBase.__call__(cls, *args, **kwargs)
     19 obj = cls.__new__(cls, *args, **kwargs)
     20 args, kwargs = cls.patch(obj, *args, **kwargs)
---> 21 obj.__init__(*args, **kwargs)
     22 return obj

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/cds.py:92, in CdsRetriever.__init__(self, dataset, *args, **kwargs)
     89 nthreads = min(self.settings("number-of-download-threads"), len(requests))
     91 if nthreads < 2:
---> 92     self.path = [self._retrieve(dataset, r) for r in requests]
     93 else:
     94     with SoftThreadPool(nthreads=nthreads) as pool:

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/cds.py:92, in <listcomp>(.0)
     89 nthreads = min(self.settings("number-of-download-threads"), len(requests))
     91 if nthreads < 2:
---> 92     self.path = [self._retrieve(dataset, r) for r in requests]
     93 else:
     94     with SoftThreadPool(nthreads=nthreads) as pool:

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/cds.py:104, in CdsRetriever._retrieve(self, dataset, request)
    101 def retrieve(target, args):
    102     self.client().retrieve(args[0], args[1], target)
--> 104 return self.cache_file(
    105     retrieve,
    106     (dataset, request),
    107     extension=EXTENSIONS.get(request.get("format"), ".cache"),
    108 )

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/__init__.py:62, in Source.cache_file(self, create, args, **kwargs)
     59 if owner is None:
     60     owner = re.sub(r"(?!^)([A-Z]+)", r"-\1", self.__class__.__name__).lower()
---> 62 return cache_file(owner, create, args, **kwargs)

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/core/caching.py:916, in cache_file(owner, create, args, hash_extra, extension, force, replace)
    912 with FileLock(lock):
    913     if not os.path.exists(
    914         path
    915     ):  # Check again, another thread/process may have created the file
--> 916         owner_data = create(path + ".tmp", args)
    917         os.rename(path + ".tmp", path)
    918         CACHE.update_entry(path, owner_data)

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/cds.py:102, in CdsRetriever._retrieve.<locals>.retrieve(target, args)
    101 def retrieve(target, args):
--> 102     self.client().retrieve(args[0], args[1], target)

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/cdsapi/api.py:364, in Client.retrieve(self, name, request, target)
    363 def retrieve(self, name, request, target=None):
--> 364     result = self._api("%s/resources/%s" % (self.url, name), request, "POST")
    365     if target is not None:
    366         result.download(target)

File ~/.conda-libs/earthkit/lib/python3.10/site-packages/cdsapi/api.py:519, in Client._api(self, url, request, method)
    517             break
    518         self.error("  %s", n)
--> 519     raise Exception(
    520         "%s. %s."
    521         % (reply["error"].get("message"), reply["error"].get("reason"))
    522     )
    524 raise Exception("Unknown API state [%s]" % (reply["state"],))

Exception: the request you have submitted is not valid. Request too large. Requesting 372 items, limit is 10.

Accompanying data

No response

Organisation

No response

sandorkertesz commented 4 months ago

Thank you for reporting this issue. Please can you provide me with the earthkit-data and cdsapi versions you are using?

In my environment the actual cds retrieval works, however from_source crashes at a later stage when tries to parse the NetCDF file that the zip file contains (see issue https://github.com/ecmwf/earthkit-data/issues/337).

However, even if it is fixed

lc_data.save(f"{LOCAL_DATA_DIR}/lc_2022.zip")

would not work properly because it would only create a NetCDF file called "lc_2022.zip". This is because lc_data represents a NetCDF file and is decoupled from the zip that originally contained it.

Unfortunately, there is no way in earthkit-data at the moment to retrieve data into a user specified file target without parsing/interpreting the downloaded file(s). So it cannot be used as a simple file retriever! (See issue: https://github.com/ecmwf/earthkit-data/issues/338)

So there are a couple of issues here, which we need to sort out before your use-case could work. I will let you know when these features will be available.

gritk commented 4 months ago

Thank you for the prompt reply!

earthkit-data version - '0.1.1.dev40+g1aaf922' cdsapi versions - 0.6.1

As you wrote, your retrieval works, what I need to change to download at least one file in original format.

lc_data = ek.data.from_source( "cds", 'satellite-land-cover', { 'year': '2022', 'version': 'v2.1.1', 'variable': 'all', } ) lc_data.save(f"{LOCAL_DATA_DIR}/")

sandorkertesz commented 3 months ago

Thanks!

1. I noticed that your download error might be related to your permissions to access the CDS:

Exception: the request you have submitted is not valid. Request too large. Requesting 372 items, limit is 10.

You can check it easily if you use the cdsapi code I posted below. If it is producing the same error it is definitely not an issue on the earthkit side.

2. I noticed your earthkit-data version is very-very old. The latest one available is 0.5.6, I suggest you upgrade to this one. However, it is not yet able to download the zip file in the way you want to do in your code. For that purpose I recommend to use cdsapi like this:

import cdsapi
cds = cdsapi.Client()
cds.retrieve(
'satellite-land-cover',
{
'year': '2022',
'version': 'v2.1.1',
'variable': 'all',
'format': 'zip',
}, 'download.zip')
gritk commented 3 months ago

Thank you very much - now it works!