bopen / c3s-eqc-toolbox-template

CADS Toolbox template application
Apache License 2.0
5 stars 4 forks source link

GRUAN UQ #1 - download data issue #59

Closed virginiaciardini closed 11 months ago

virginiaciardini commented 1 year ago

Notebook description

For my analysis I need to download all the available data for 3 GRUAN stations (LIN, NYA and TEN), that I identify in my request through the "area" field, to calulate the lapse rate tropopause at different latitude. Code and text are still under definition, but I need support.

Notebook link or upload

C3S_520_Quality_assessment_Template_gruan_uq1.zip

Anything else we need to know?

Firstly, I did some tests locally on my machine, now I would like to run it on the VM. First issue I faced: downloading data for a single year, my routine works well, otherwise, for the entire data range (e.g. 2006-2015) it doesn't. Secondly, I don't understand exactly how to use the "download.download_and_transform" for my purpose. Could you please have a look to the code and give me some advises?

I'm new with python and I’m working on my first Jupyter Notebook. Before continuing with the analysis, I would like to receive your feedback to optimize the code and to fix any issues. Thanks a lot, Virginia

Environment

malmans2 commented 1 year ago

Hi @virginiaciardini,

I don't think I'll be able to work on this today, so I'll probably look at this next week. I'll send you a snippet or a template to show you how to use our software with your dataset!

virginiaciardini commented 1 year ago

Hi @malmans2, ok, thanks!

malmans2 commented 1 year ago

Hi @virginiaciardini,

The template is ready. You can find it here: https://github.com/bopen/c3s-eqc-toolbox-template/blob/main/notebooks/wp5/tropopause.ipynb

You can just change start/stop in the cell at the top, and it should work with the time period you'd like to analyse.

This is what's happening under the hood:

  1. We download data using monthly chunks. Raw data are cached and stored on the VM, so as long as you don't change any parameter of the request (e.g., area or variables), you will only download the data once (I already downloaded and cached most of the data on the VM).
  2. We apply a transformation function to each chunk. In this case, it's compute_tropopause_altitude
  3. We store and cache each transformed chunks in NetCDF (you don't really see this, is just to let you know what's happening)

A few comments:

  1. I'm more familiar with xarray, so I used ds to make the plot. If you prefer to use pandas, you can just do this: df = ds.to_pandas()
  2. I simplified the function that you used to compute the tropopause altitude. It was working OK, but it didn't look very "pythonic". Please make sure that it's working as expected. I only implemented the algorithm to find a single altitude, let me know if you want to add the option to find more levels. If it works OK, we can implement it in our software, so you'll be able to import it directly from c3s_eqc_automatic_quality_control.diagnostics without having to define it in the notebook.
  3. Your data is not too big. I suggest to apply transform functions to the whole dataset and exclude useless stations only when you do the analysis/visualization. That way you don't have to re-download data every time (i.e., that's why I got rid of the area parameter).

I've already cached the global tropopause altitude from 2006-07 to 2017-12. So if you run the template changing start/stop only, it should be very quick.

Here is the results I get using the time period in your notebook:

start = "2016-02"
stop = "2016-02"

image

Using the same area as well, I can reproduce your figure:

start = "2016-02"
stop = "2016-02"

request = {
    "area": [55, 10, 50, 15],
    "format": "csv-lev.zip",
    "variable": ["air_temperature", "altitude"],
}

image

virginiaciardini commented 1 year ago

Hi @malmans2, Thanks, I'll test it and I' ll let you know if I everything is claer to me Best, Virginia

virginiaciardini commented 1 year ago

HI @malmans2, I tested the template; firstly, as you suggested, applying transform functions to the whole dataset (Ichanged start and stop) but then I didn't find how to exclude useless stations; trying something I used the "area" as well (as you shown above), but I received this error message:

_ValueError                                Traceback (most recent call last)
Cell In[10], line 1
----> 1 ds = download.download_and_transform(
      2     collection_id,
      3     requests,
      4     chunks={"year": 1, "month": 1},
      5     transform_func=compute_tropopause_altitude,
      6 )

File /data/common/mambaforge/envs/wp5/lib/python3.10/site-packages/c3s_eqc_automatic_quality_control/download.py:545, in download_and_transform(collection_id, requests, chunks, split_all, transform_func, transform_func_kwargs, transform_chunks, n_jobs, invalidate_cache, cached_open_mfdataset_kwargs, **open_mfdataset_kwargs)
    540             cacholote.delete(
    541                 func.func, *func.args, request_list=[request], **func.keywords
    542             )
    543         with cacholote.config.set(return_cache_entry=True):
    544             sources.append(
--> 545                 func(request_list=[request]).result["args"][0]["href"]
    546             )
    547     ds = xr.open_mfdataset(sources, **cached_open_mfdataset_kwargs)
    548 else:
    549     # Cache final dataset transformed

File /data/common/mambaforge/envs/wp5/lib/python3.10/site-packages/cacholote/cache.py:86, in cacheable.<locals>.wrapper(*args, **kwargs)
     83             warnings.warn(str(ex), UserWarning)
     84             clean._delete_cache_entry(session, cache_entry)
---> 86 result = func(*args, **kwargs)
     87 cache_entry = database.CacheEntry(
     88     key=hexdigest,
     89     expiration=settings.expiration,
     90     tag=settings.tag,
     91 )
     92 try:

File /data/common/mambaforge/envs/wp5/lib/python3.10/site-packages/c3s_eqc_automatic_quality_control/download.py:434, in _download_and_transform_requests(collection_id, request_list, transform_func, transform_func_kwargs, **open_mfdataset_kwargs)
    431     ds = xr.open_mfdataset(sources, **open_mfdataset_kwargs)
    433 if transform_func is not None:
--> 434     ds = transform_func(ds, **transform_func_kwargs)
    435     if not isinstance(ds, xr.Dataset):
    436         raise TypeError(
    437             f"`transform_func` must return a xr.Dataset, while it returned {type(ds)}"
    438         )

Cell In[5], line 53, in compute_tropopause_altitude(ds)
     51 def compute_tropopause_altitude(ds):
     52     dataarrays = []
---> 53     for report_id, ds_id in ds.groupby(ds["report_id"]):
     54         coords = {"report_id": ("time", [report_id])}
     55         for var, da_coord in ds_id.data_vars.items():

File /data/common/mambaforge/envs/wp5/lib/python3.10/site-packages/xarray/core/dataset.py:9031, in Dataset.groupby(self, group, squeeze, restore_coord_dims)
   9023 from xarray.core.groupby import (
   9024     DatasetGroupBy,
   9025     ResolvedUniqueGrouper,
   9026     UniqueGrouper,
   9027     _validate_groupby_squeeze,
   9028 )
   9030 _validate_groupby_squeeze(squeeze)
-> 9031 rgrouper = ResolvedUniqueGrouper(UniqueGrouper(), group, self)
   9033 return DatasetGroupBy(
   9034     self,
   9035     (rgrouper,),
   9036     squeeze=squeeze,
   9037     restore_coord_dims=restore_coord_dims,
   9038 )

File <string>:6, in __init__(self, grouper, group, obj)

File /data/common/mambaforge/envs/wp5/lib/python3.10/site-packages/xarray/core/groupby.py:335, in ResolvedGrouper.__post_init__(self)
    334 def __post_init__(self) -> None:
--> 335     self.group: T_Group = _resolve_group(self.obj, self.group)
    337     (
    338         self.group1d,
    339         self.stacked_obj,
    340         self.stacked_dim,
    341         self.inserted_dims,
    342     ) = _ensure_1d(group=self.group, obj=self.obj)

File /data/common/mambaforge/envs/wp5/lib/python3.10/site-packages/xarray/core/groupby.py:641, in _resolve_group(obj, group)
    638         newgroup = group
    640 if newgroup.size == 0:
--> 641     raise ValueError(f"{newgroup.name} must not be empty")
    643 return newgroup

ValueError: report_id must not be empty_

could you give me same suggestions? thanks, Virginia

malmans2 commented 1 year ago

Hi @virginiaciardini,

I added a stations parameter in the latest template, to show you how I would do that: https://github.com/bopen/c3s-eqc-toolbox-template/blob/main/notebooks/wp5/tropopause.ipynb

Looks like the area approach is not working very well because there are months with no data at all. We could easily fix it, but I think the other approach is better (i.e., compute and cache tropopause for the whole dataset, then filter it).

image

virginiaciardini commented 1 year ago

Hi @malmans2 , I tested the "filter stations", thanks al lot

malmans2 commented 1 year ago

You're welcome! Do you need to add more analyses to this notebook or can we close this issue?

virginiaciardini commented 1 year ago

Hi, now, i need to add monthly and seasonal means of the tropopause altitude; I tried to write the routine and then I’ll ask your support to verify it or your help if I have problems. Thanks, Virginia


Virginia Ciardini ENEA Laboratorio di Osservazioni E Misure per l'ambiente e il clima (SSPT-PROTER-OEM) Laboratory of Observations And Measures for the environment and climate Via Anguillarese, 301 00123 Roma Italy Tel: +39 06 3048 6127 VoIP: +39 06 3048 7435 Fax: +39 06 3048 6678

Da: Mattia Almansi @.> Inviato: lunedì 12 giugno 2023 10:35 A: bopen/c3s-eqc-toolbox-template @.> Cc: Virginia Ciardini @.>; Mention @.> Oggetto: Re: [bopen/c3s-eqc-toolbox-template] GRUAN UQ #1 - download data issue (Issue #59)

You're welcome! Do you need to add more analyses to this notebook or can we close this issue?

— Reply to this email directly, view it on GitHubhttps://github.com/bopen/c3s-eqc-toolbox-template/issues/59#issuecomment-1586850761, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A66BG4WL6SNDAO47NXVPZ33XK3IEZANCNFSM6AAAAAAYWUIERQ. You are receiving this because you were mentioned.Message ID: @.**@.>>


Questo messaggio e i suoi allegati sono indirizzati esclusivamente alle persone indicate e la casella di posta elettron ica da cui è stata inviata è da qualificarsi quale strumento aziendale.

La diffusione, copia o qualsiasi altra azione derivante dalla conoscenza di queste informazioni sono rigorosamente viet ate (art. 616 c.p, D.Lgs. n. 196/2003 s.m.i. e GDPR Regolamento - UE 2016/679).

Qualora abbiate ricevuto questo documento per errore siete cortesemente pregati di darne immediata comunicazione al mit tente e di provvedere alla sua distruzione. Grazie.

This e-mail and any attachments is confidential and may contain privileged information intended for the addressee(s) on ly.

Dissemination, copying, printing or use by anybody else is unauthorised (art. 616 c.p, D.Lgs. n. 196/2003 and subsequen t amendments and GDPR UE 2016/679).

If you are not the intended recipient, please delete this message and any attachments and advise the sender by return e -mail. Thanks.


malmans2 commented 1 year ago

OK, the VM is experiencing problem at the moment. Please don't use it until further notice (should be quick).

virginiaciardini commented 1 year ago

I see. I wait your ok. thanks

malmans2 commented 1 year ago

The VM is back in business. It was rebooted, so you'll have to re-do the procedure and run jupyter_server. You'll notice that the interface changed, we're now using jupyter lab instead of jupyter notebook, as the latter will eventually be deprecated.

virginiaciardini commented 1 year ago

HI @malmans2 I modified the JN and again I need your support; after several attempts there’s something that I can not solve.

  1. 2nd figure: I tried to format the xaxis ticks as plot above (2006, 2007, …2020)
  2. I’d like to add another figure, similarly to the 2nd figure, but resampling on "time.season"; I tried but it does not work;
  3. I’d like to save LRT into a file txt and after download/copy it on my local machine Could you please have a look to the code and give me some advises? Before continuing and finalize the JN, I would like to receive your feedback to optimize the code and to fix any issues. Thanks C3S_520_Quality_assessment_Template_gruan_uq1.zip
virginiaciardini commented 1 year ago

Hi @malmans2, I'm trying some function calling the module from statsmodels. but I receive the following message: ModuleNotFoundError: No module named 'statsmodels' Could you help me? Thanks,

malmans2 commented 1 year ago

Hi @virginiaciardini,

I've been out of the office a couple of weeks, so I have a few issues in the backlog and I haven't looked at your updated notebook yet.

statsmodels is not part of python standard library, so it needs to be installed. Do you want me to install it on the VM?

virginiaciardini commented 1 year ago

Hi @malmans2, thanks. If it is possible, yes i do.

malmans2 commented 1 year ago

OK, there's a few people that are using the VM right now. I'll do it overnight to make sure we don't break their environments.

You'll find it installed tomorrow morning, make sure you restart the kernel before importing it.

virginiaciardini commented 1 year ago

thanks

malmans2 commented 1 year ago

Hi @virginiaciardini, sorry again for the delay.

2nd figure: I tried to format the xaxis ticks as plot above (2006, 2007, …2020)

for station, da in ds["tropopause"].groupby("station_name"):
da_resampled = da.resample(time="M")
da_mean = da_resampled.mean().to_pandas()
da_std = da_resampled.std().to_pandas()
da_mean.plot(yerr=da_std, marker=".", label=station)

I’d like to add another figure, similarly to the 2nd figure, but resampling on "time.season"; I tried but it does not work;

Assuming you want DJF, MAM, JJA, SON, I think you can substitute da_resampled with the following (I never used it though, so please make sure that it's correct):

da_resampled = da.resample(time="QS-DEC")

I’d like to save LRT into a file txt and after download/copy it on my local machine

ds.to_pandas().to_csv("my_file.csv")
virginiaciardini commented 1 year ago

Hi @malmans2, thanks, I'm following your instructions and working on the VM but I have some problems with the server connection. I was running my JN and this error message appeared. Server Connection Error A connection to the Jupyter server could not be established. JupyterLab will continue trying to reconnect. Check your network connection or Jupyter server configuration. My connection is ok; do you know if there is any limitations today? thanks, Virginia

malmans2 commented 1 year ago

Looks OK now, but maybe there was an hiccup before. Can you try to close and re-open the ssh tunnel? For example, from your local machine do this to close all ssh tunnels: pkill ssh

Then, re-do the usual procedure to work with jupyter (log into the VM, log into your user, go to you directory, run jupyter_server, follow the instructions)

virginiaciardini commented 1 year ago

Thanks, now it works

virginiaciardini commented 1 year ago

Hi @malmans2 , i'm trying to open my JN. I connect to the VM but I cannot open the JN in the browser. Do you know what can be the problem? thanks, Virginia

malmans2 commented 1 year ago

You just have to re-do this: https://github.com/bopen/c3s-eqc-toolbox-template/issues/59#issuecomment-1643552234

We had to close the tunnels overnight. Keep in mind that the port assigned to you might have changed, so make sure you copy & paste the new commands printed by (jupyter_server)

virginiaciardini commented 1 year ago

Hi @malmans2 , I updated my JN (enclosed); I would like to reveice your feedback to optimize the code and to fix any issues. In figure 3 I tried to use time as x_array but I didn't succeed and I used number of records (i.e. n. of months). Could you please help me to fix it. Thanks, Virginia C3S_520_Quality_assessment_Template_gruan_uq1_v2.zip

malmans2 commented 1 year ago

Hi @virginiaciardini,

I've updated the template: https://github.com/bopen/c3s-eqc-toolbox-template/blob/main/notebooks/wp5/tropopause.ipynb Here is the template executed: https://gist.github.com/malmans2/8fc0093b534d38b8820c5497d6892c57

If I'm understanding correctly how seasonal_decompose works, I think you are supposed to feed regularly sampled data. Therefore, you need to interpolate missing months rather than dropping them. This is why the results are slightly different compared to your version.

Please let me know if everything works OK.

virginiaciardini commented 1 year ago

Hi @malmans2, thanks a lot. I've been out of office so I'm looking at your update now; I'll let you know if everything works ok. Thanks!

malmans2 commented 11 months ago

Hi @virginiaciardini,

Was this template OK? Can we close this issue?

virginiaciardini commented 11 months ago

Yes, you can. Thanks Virginia