How to handle large arrays

sandrocalmanti commented 1 year ago

Describe the solution you'd like

In WP5 I'm using the attached notebook to show temperature anomalies on pressure levels.

The notebook works correctly when selecting a limited domain (for example 10S-10N) and a limited number of years, but I get into trouble when handling larger datasets.

Ideally I would like to compute the average vertical profile of temperature anomaly for the entire globe (-90:90, -180:180) and for the full time series, from 1940 to 2022. In this case, the file is 25GB (I'm using the monthly averaged reanalysis).

I guess others may have already had this problem but I couldn't find any past issue on this subject. I expect to have similar issues in our work for WP3 on seasonal forecasts

How do I handle large arrays in general? Is it with the download_and_transform function?

my_ipynb.zip

malmans2 commented 1 year ago

Hi @sandrocalmanti,

Yes, it's the download_and_transform function. The argument chunks allows to split the request in various smaller requests (e.g., 1 request per year, or 1 per month, ...). Furthermore, the transform_func is applied to each chunk and cached separately (particularly useful for data reduction, as much smaller data is saved on disk). Finally, we use dask under the hood, which is a library that allows out-of-memory computations.

sandrocalmanti commented 1 year ago

Great, thank you Mattia,

is there any example I can use to understand how download_and_transform works, or could you describe how it is applied to this simple kernel so it work for large datasets as it is working for small ones?

ds = xr.open_dataset('./DATA/ERA5_ta_plev_monthly_1973-1999.nc')
t = ds['t']

#Weight temperature values with latitude before averaging
weights = np.cos(np.deg2rad(ds.latitude))
t_weighted = ds.t.weighted(weights)

#Compute monthly global average
t_ave = t_weighted.mean(["longitude", "latitude"]).transpose("level","time")
#Compute climatology
t_ave_time = t_ave.mean(["time"])
#Compute anomalies
t_ave_anom = t_ave - t_ave_time

malmans2 commented 1 year ago

You can combine download. download_and_transform and diagnostics.spatial_weighted_mean. They're both used in many templates, and climatology are produced in a few WP4 templates.

I can do a template specific for your use case. Is this the full dataset you need?

    'reanalysis-era5-pressure-levels-monthly-means',
    {
        'format': 'netcdf',
        'product_type': 'monthly_averaged_reanalysis',
        'variable': 'temperature',
        'pressure_level': [
            '1', '5', '20',
            '70', '150', '225',
            '350', '500', '650',
            '775', '850', '925',
            '1000',
        ],
        'year': [
            '1940', '1941', '1942',
            '1943', '1944', '1945',
            '1946', '1947', '1948',
            '1949', '1950', '1951',
            '1952', '1953', '1954',
            '1955', '1956', '1957',
            '1958', '1959', '1960',
            '1961', '1962', '1963',
            '1964', '1965', '1966',
            '1967', '1968', '1969',
            '1970', '1971', '1972',
            '1973', '1974', '1975',
            '1976', '1977', '1978',
            '1979', '1980', '1981',
            '1982', '1983', '1984',
            '1985', '1986', '1987',
            '1988', '1989', '1990',
            '1991', '1992', '1993',
            '1994', '1995', '1996',
            '1997', '1998', '1999',
            '2000', '2001', '2002',
            '2003', '2004', '2005',
            '2006', '2007', '2008',
            '2009', '2010', '2011',
            '2012', '2013', '2014',
            '2015', '2016', '2017',
            '2018', '2019', '2020',
            '2021', '2022',        ],
        'month': [
            '01', '02', '03',
            '04', '05', '06',
            '07', '08', '09',
            '10', '11', '12',
        ],
        'time': '00:00',
        'area': [
            90, -180, -90,
            180,
        ],
    },

sandrocalmanti commented 1 year ago

Thank you Mattia,

you're right, the templates. I'll have a look at that. Meanwhile, yes that's the full dataset.

S.

malmans2 commented 1 year ago

OK, I'll send you the link when it's ready. In the meantime, this is pretty much what you are looking for: https://github.com/bopen/c3s-eqc-toolbox-template/blob/main/notebooks/renalysis/02-Application_Template_Global_Timeseries_Pressure_Levels.ipynb

malmans2 commented 1 year ago

Hi @sandrocalmanti,

Your template is ready and available here. The template is a small example (2022-present), you just have to change the variable start to produce larger timeseries. The first time you run it it will take some time to download all data, but then the spatial weighted fields are cached and you can focus on the analysis.

The template produces this figure (this period is already cached on WP5):

sandrocalmanti commented 1 year ago

Thank you Mattia, looks great.

I'll try later.

sandrocalmanti commented 1 year ago

my_ipynb.zip

Dear @malmans2

I have updated your template with two edits:

compute the anomalies wrt the monthly climatology insted of using the global climatology
plot the vertical vertical slice with countourf

In case you want to update this in the wp5 templates

Cheers

S.

malmans2 commented 1 year ago

Great! Looks like caching is also working pretty good, I've been able to play with your notebook on the VM.

I've updated the template. I only changed a few things in the last cell to make the Hovmöller Diagram with xarray (there's a couple of arguments you could find useful, such as robust=True).

I'm closing this, but feel free to re-open in the future!

bopen / c3s-eqc-toolbox-template

How to handle large arrays #43

Describe the solution you'd like