Closed johtoblan closed 1 year ago
This is for the wp3 on the VM
Here is the traceback for the same request with split_all = False
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[1], line 15
3 collection_id = "seasonal-monthly-single-levels"
4 request = {
5 "format": "grib",
6 "originating_centre":"cmcc",
(...)
12 "month": "01",
13 }
---> 15 dsh = download.download_and_transform(
16 collection_id,
17 request,
18 split_all=False,
19 )
File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/c3s_eqc_automatic_quality_control/download.py:450, in download_and_transform(collection_id, requests, chunks, split_all, transform_func, transform_func_kwargs, transform_chunks, logger, **open_mfdataset_kwargs)
447 request_list.extend(split_request(request, chunks, split_all))
449 if not transform_chunks or transform_func is None:
--> 450 ds = download_and_transform_requests(
451 collection_id,
452 tqdm.tqdm(request_list),
453 transform_func,
454 transform_func_kwargs,
455 **open_mfdataset_kwargs,
456 )
457 else:
458 # Cache each chunk separately
459 sources = []
File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/c3s_eqc_automatic_quality_control/download.py:379, in _download_and_transform_requests(collection_id, request_list, transform_func, transform_func_kwargs, **open_mfdataset_kwargs)
377 raise TypeError(f"`emohawk` returned {type(ds)} instead of a xr.Dataset")
378 else:
--> 379 ds = xr.open_mfdataset(sources, **open_mfdataset_kwargs)
381 if transform_func is not None:
382 ds = transform_func(ds, **transform_func_kwargs)
File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/xarray/backends/api.py:986, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
984 closers = [getattr_(ds, "_close") for ds in datasets]
985 if preprocess is not None:
--> 986 datasets = [preprocess(ds) for ds in datasets]
988 if parallel:
989 # calling compute here will return the datasets/file_objs lists,
990 # the underlying datasets will still be stored as dask arrays
991 datasets, closers = dask.compute(datasets, closers)
File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/xarray/backends/api.py:986, in <listcomp>(.0)
984 closers = [getattr_(ds, "_close") for ds in datasets]
985 if preprocess is not None:
--> 986 datasets = [preprocess(ds) for ds in datasets]
988 if parallel:
989 # calling compute here will return the datasets/file_objs lists,
990 # the underlying datasets will still be stored as dask arrays
991 datasets, closers = dask.compute(datasets, closers)
File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/c3s_eqc_automatic_quality_control/download.py:317, in _preprocess(ds, collection_id, preprocess)
314 ds = cgul.harmonise(ds)
316 # TODO: workaround: sometimes single timestamps are squeezed
--> 317 if "time" not in ds.cf.dims:
318 if "forecast_reference_time" in ds.cf:
319 ds = ds.cf.expand_dims("forecast_reference_time")
File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/cf_xarray/accessor.py:1394, in CFAccessor.__getattr__(self, attr)
1393 def __getattr__(self, attr):
-> 1394 return _getattr(
1395 obj=self._obj,
1396 attr=attr,
1397 accessor=self,
1398 key_mappers=_DEFAULT_KEY_MAPPERS,
1399 wrap_classes=True,
1400 )
File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/cf_xarray/accessor.py:656, in _getattr(obj, attr, accessor, key_mappers, wrap_classes, extra_decorator)
654 for name in inverted[key]:
655 if name in newmap:
--> 656 raise AttributeError(
657 f"cf_xarray can't wrap attribute {attr!r} because there are multiple values for {name!r}. "
658 f"There is no unique mapping from {name!r} to a value in {attr!r}."
659 )
660 newmap.update(dict.fromkeys(inverted[key], value))
661 newmap.update({key: attribute[key] for key in unused_keys})
AttributeError: cf_xarray can't wrap attribute 'dims' because there are multiple values for 'vertical'. There is no unique mapping from 'vertical' to a value in 'dims'.
Hi @johtoblan,
I'm on it, looks like a bug indeed but also an easy fix. I'm currently also working on another couple of improvements for the cache, so be aware that later today I might have to cleanup the cache of the VM. But I will re-populate it with the request you sent in the first message.
I'll shoot you a message here when I'll start cleaning up the cache and when I'm done.
Hi @johtoblan,
I tried a couple of years and I couldn't reproduce the issue. Now I'm trying with all years you had in your request, but it looks like the problem is originated by one or more years that are not consistent with each other.
I was hoping to find the data already cached on WP3, but that's not the case. It's taking a long time to download. Did you run that exact request? If not, could you please send me the request you've been running?
(Still, I think it's a bug that way our code is failing)
Hi @malmans2, the original request was with leadtime = "2" I see, but all the other parameters described above were the same. but our experience has also been that the error is a bit hard to reproduce
Hi, I'm a little confused. What is the request that returned this error?
Here is the traceback for the same request with split_all = False
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[1], line 15 3 collection_id = "seasonal-monthly-single-levels" 4 request = { 5 "format": "grib", 6 "originating_centre":"cmcc", (...) 12 "month": "01", 13 } ---> 15 dsh = download.download_and_transform( 16 collection_id, 17 request, 18 split_all=False, 19 ) File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/c3s_eqc_automatic_quality_control/download.py:450, in download_and_transform(collection_id, requests, chunks, split_all, transform_func, transform_func_kwargs, transform_chunks, logger, **open_mfdataset_kwargs) 447 request_list.extend(split_request(request, chunks, split_all)) 449 if not transform_chunks or transform_func is None: --> 450 ds = download_and_transform_requests( 451 collection_id, 452 tqdm.tqdm(request_list), 453 transform_func, 454 transform_func_kwargs, 455 **open_mfdataset_kwargs, 456 ) 457 else: 458 # Cache each chunk separately 459 sources = [] File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/c3s_eqc_automatic_quality_control/download.py:379, in _download_and_transform_requests(collection_id, request_list, transform_func, transform_func_kwargs, **open_mfdataset_kwargs) 377 raise TypeError(f"`emohawk` returned {type(ds)} instead of a xr.Dataset") 378 else: --> 379 ds = xr.open_mfdataset(sources, **open_mfdataset_kwargs) 381 if transform_func is not None: 382 ds = transform_func(ds, **transform_func_kwargs) File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/xarray/backends/api.py:986, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs) 984 closers = [getattr_(ds, "_close") for ds in datasets] 985 if preprocess is not None: --> 986 datasets = [preprocess(ds) for ds in datasets] 988 if parallel: 989 # calling compute here will return the datasets/file_objs lists, 990 # the underlying datasets will still be stored as dask arrays 991 datasets, closers = dask.compute(datasets, closers) File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/xarray/backends/api.py:986, in <listcomp>(.0) 984 closers = [getattr_(ds, "_close") for ds in datasets] 985 if preprocess is not None: --> 986 datasets = [preprocess(ds) for ds in datasets] 988 if parallel: 989 # calling compute here will return the datasets/file_objs lists, 990 # the underlying datasets will still be stored as dask arrays 991 datasets, closers = dask.compute(datasets, closers) File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/c3s_eqc_automatic_quality_control/download.py:317, in _preprocess(ds, collection_id, preprocess) 314 ds = cgul.harmonise(ds) 316 # TODO: workaround: sometimes single timestamps are squeezed --> 317 if "time" not in ds.cf.dims: 318 if "forecast_reference_time" in ds.cf: 319 ds = ds.cf.expand_dims("forecast_reference_time") File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/cf_xarray/accessor.py:1394, in CFAccessor.__getattr__(self, attr) 1393 def __getattr__(self, attr): -> 1394 return _getattr( 1395 obj=self._obj, 1396 attr=attr, 1397 accessor=self, 1398 key_mappers=_DEFAULT_KEY_MAPPERS, 1399 wrap_classes=True, 1400 ) File /data/common/mambaforge/envs/wp3/lib/python3.10/site-packages/cf_xarray/accessor.py:656, in _getattr(obj, attr, accessor, key_mappers, wrap_classes, extra_decorator) 654 for name in inverted[key]: 655 if name in newmap: --> 656 raise AttributeError( 657 f"cf_xarray can't wrap attribute {attr!r} because there are multiple values for {name!r}. " 658 f"There is no unique mapping from {name!r} to a value in {attr!r}." 659 ) 660 newmap.update(dict.fromkeys(inverted[key], value)) 661 newmap.update({key: attribute[key] for key in unused_keys}) AttributeError: cf_xarray can't wrap attribute 'dims' because there are multiple values for 'vertical'. There is no unique mapping from 'vertical' to a value in 'dims'.
If I run the code you have in the first comment it looks like the download has not been cached, and I can't reproduce the error right away.
Hi
I hope it is OK that I answere this Johannes? Johannes and I looked at this together earlier today. Did you use all 24 year from 1993 to 2016 using split_all=True ? Usually I get this error in the first comment after a while, so the first 5-6 downloads in the request works fine, and when it come to f.ex. 5/24 I get the error above. When this error occurs I have been stuck in queue for a while. When I use split_all=False I get AttributeError, except when I am using leadtime_month="1" . Leadtime_month="1" seems to work fine.
The request that gave the error in the first comment was:
import warnings
from c3s_eqc_automatic_quality_control import diagnostics, download, plot
warnings.filterwarnings("ignore")
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import cm
centre = "cmcc" # Data centre
system = "35" # Model version
variable_long = "2m_temperature" # Variable name in download request
variable_short = "t2m" # Variable name after download
reanalysis_month = "02" # Must correspond with forecast month + leadtime
reanalysis_year = [str(year) for year in range(1993, 2017)] # Reanalysis years
hindcast_year = [str(year) for year in range(1993, 2017)] # Hindcast years
hindcast_month = "02" # Model start month
leadtime = "1" # Leadtime month
area_name = "global" # Name of area
area_coordinates = [89.5,-179.5,-89.5,179.5] # Area [maxlat,minlon,minlat,maxlon]
collection_idh = "seasonal-monthly-single-levels"
requesth = {
"format": "grib",
"originating_centre": centre,
"system": system,
"variable": variable_long,
"product_type": "monthly_mean",
"year": hindcast_year,
"leadtime_month": leadtime,
"month": hindcast_month,
}
dsh = download.download_and_transform(
collection_idh,
requesth,
split_all=True,
)
Hi there,
The CDS queue is very long right now. I've been able to download a dozen of the datasets using the code in the first comment, so far so good. I need somehow to reproduce the error to debug. But I'm on it!
I've added a few improvements to the cache (things like requests dict order or type don't matter anymore), and a couple of more informative errors, so at least we'll have a better clue of what's going on. I'm going to wipe the cache and update the environments on the VM tonight, hopefully from now on things will get a bit smoother.
(BTW, if you are all using the same credentials, I suggest that everyone starts using their personal cdsapirc. Maybe it will help the queueing time. See the top cell here: https://github.com/bopen/c3s-eqc-toolbox-template/blob/main/notebooks/01-Application_Template_Overview.ipynb)
~@annemo1976 sorry I didn't realise it right away, but in the request you shared you can not use split_all
.~
~split_all
splits all parameters that are iterables (but not strings) into single requests.
area_coordinates
must be a list though, I'm actually surprised the CDS allowed the request.
So in your case, you need to pass chunks explicitly, so for example you can do chunks={"reanalysis_year": 1, "hindcast_year": 1}
~
~Anyways, I'm clearing the cache in a bit and I will run the request overnight. Hopefully we'll wake up with those request cached.~
Oops nevermind, area_coordinates
was not actually used in the request.
Thank you for looking into it.
After 6 hours I am on 11/24 and it is still running, I have never got this far before :-) Is it possible to download the hindcast data in only one request as I did last week? Last week I could use split_all=False without getting the AttributeError, and I get all hindcast data in one request and it took only 15-30 min.
Now I also understand why area only work when I use spit_all=False :-)
Hi @johtoblan and @annemo1976,
Good and baddish news.
Let's start with the good one: I ran a script overnight to populate using concurrent calls and you now have some data available. Here is how you can open the dataset:
from c3s_eqc_automatic_quality_control import download
year_start = 1993
year_stop = 2016
collection_id = "seasonal-monthly-single-levels"
request = {
"year": [str(year) for year in range(year_start, year_stop + 1)],
"originating_centre": "cmcc",
"system": "35",
"variable": "2m_temperature",
"product_type": "monthly_mean",
"month": [f"{month:02d}" for month in range(1, 12 + 1)],
"leadtime_month": ["1"],
"format": "grib",
}
xr_open_mfdataset_kwargs = {
"concat_dim": "forecast_reference_time",
"combine": "nested",
"parallel": True,
}
ds = download.download_and_transform(
collection_id,
request,
chunks={"year": 1, "leadtime_month": 1},
**xr_open_mfdataset_kwargs,
)
So you'll notice that in your case I had to pass a few arguments to xarray. Besides for parallel
that is just parsing the metadata a little faster, concat_dim
and combine
are to be used because the datasets downloaded are a little weird (but I don't know a lot about this dataset, maybe it's al good). What's happening is that there's a leadtime
dimension. All timestamps have only a leadtime, but because it's not always the same the dimension has size 4. Things get even more complicated because all years have 3 unique leadtimes, so the actual dimension in the raw data is 3.
Let's see if this explains better what I mean:
ds.sizes
Frozen({'realization': 40, 'leadtime': 4, 'latitude': 180, 'longitude': 360, 'forecast_reference_time': 288})
# Compute where values are all nans
da = ds["t2m"].isnull().all(set(ds.dims) - {"forecast_reference_time", "leadtime"}).compute()
da.attrs["long_name"] = "0: valid values; 1: all NaNs"
# 3 leadtime out of 4 are always nan
set(da.sum("leadtime").values)
{3}
da.plot(row="leadtime", marker=".", ls="none")
This is very inefficient as 75% of the dataset is NaNs. To me leadtime should be just a 1D coordinate with dimension forecast_reference_time
.
Do you know if the issue just shows up with this chunking (yearly data with just 1 leadtime)? In your case we should optimise the chunking as much as possible, so everyone will use the same cached data.
Was this request just a test? What kind of data do you need for your use case (more years? more leadtimes? more variables...?)
Let me know!
Hi Mattia
Unfortunately the different number of days in leadtime makes it more difficult. When downloading directly from the api, it is possible to use the following command when opening the dataset i xarray:
ds = xr.open_dataset(filename, engine='cfgrib', backend_kwargs=dict(time_dims=('forecastMonth', 'time')))
This converts the leadtime from days to forecastMonth (more info here: https://ecmwf-projects.github.io/copernicus-training-c3s/sf-anomalies.html). Could this be a possibility for the toolbox also? If not, I understand that I need to split the download into multiple requests. To spit it up is only a problem since it takes a very long time in queue of course, and we will bring this up as an issue at the meeting in Rome. The code you provided above works fine, so thank you very much :-) I will also start to use chunks. The request we sent was a part of a notebook to compare the uncertainty of a forecast compared to climatology (where the datacentre, model version, forecast_year, month, leadtime, variable and area is given at the top op the notebook and can easily be changed). Her is the input to the notebook: centre = "cmcc" # Data centre system = "35" # Model version variable_long = "2m_temperature" # Variable name in download request forecast_year = "2023" # Forecast year hindcast_year = [str(year) for year in range(1993, 2017)] # Hindcast years month = "01" # Model start month leadtime = "1" # Leadtime month area_name = "global" # Name of area area_coordinates = [89.5,-179.5,-89.5,179.5] # Area [maxlat,minlon,minlat,maxlon] The notebook only uses data from one month and one leadtime, but need 24 years of data to calculate the mean uncertainty form the hindcastyears 1993-2016.
Best regards Anne-Mette
Good news! I think the kwargs you provided are working as expected. So here is the final code to deal with this dataset:
from c3s_eqc_automatic_quality_control import download
year_start = 1993
year_stop = 2016
collection_id = "seasonal-monthly-single-levels"
request = {
"year": [str(year) for year in range(year_start, year_stop + 1)],
"originating_centre": "cmcc",
"system": "35",
"variable": "2m_temperature",
"product_type": "monthly_mean",
"month": [f"{month:02d}" for month in range(1, 12 + 1)],
"leadtime_month": ["1"],
"format": "grib",
}
ds = download.download_and_transform(
collection_id,
request,
chunks={"year": 1, "leadtime_month": 1},
backend_kwargs={"time_dims": ('forecastMonth', 'time')},
)
print(ds.dims)
Frozen({'realization': 40, 'forecast_reference_time': 288, 'latitude': 180, 'longitude': 360})
Could this be a possibility for the toolbox also? If not, I understand that I need to split the download into multiple requests. To spit it up is only a problem since it takes a very long time in queue of course, and we will bring this up as an issue at the meeting in Rome.
Yes, let's talk about it next week. But here is the general idea. If we coordinate well, the downside of long downloading time should be negligible when things will be more stable. If everyone uses the same chunking in WP3, they will find most of the data they need already cached, and it's just a matter of downloading new data as they get available. This is why I'd suggest to only use the chunking in the code above.
If you need more/different data, just send me the request and I will run the scripts. As I briefly mentioned, I now have scripts to quickly populate the cache (also useful in case we need to clear the cache). In the future, ECMWF might also assign us a priority user.
Hi Mattia
Very nice that we can use 'backend_kwargs' in the toolbox as well :-) By using the code above I can also use area that interpolate the data to a correct grid. Chunking and caching of data seems like a good option, we can talk more about this next week in Rome. Today downloading data from CDS is much faster. What took forever yesterday, takes only a couple of minutes today. Thank you very much for looking into this!! See you next week in Rome :-)
Anne-Mette
Hi WP3ers, I'm closing this as it looks like it's now fixed and it makes easier for us to track progress. Feel free to open new issues though!
We have been using split_all = False earlier, to bulk download and prevent the earlier error mentioned in issue #14
This works for the 0th request (Welcome, sent request, queued and finally downloaded)
But after the 1st request is queued we get an error with this exception report:
We also have problems specifying area with split_all = True, but we do not have an error message for that right now, due to the queue