Open almarbo opened 3 weeks ago
Hi @almarbo,
It looks like you haven't been using the VM. More importantly, you're running an outdated version of the EQC software. The latest version now relies on earthkit-data, the official ECMWF library, whereas you're still using emohawk, which has been deprecated by ECMWF.
When we migrated to the new VM a few months ago, we ensured that all existing notebooks were cached to use earthkit. I recommend either updating your environment and clearing the cache or switching to the VM.
I glanced through the notebook, and the overall structure looks good as it follows the other templates. Is there anything specific you’d like me to review? Do you want me to cache the notebook on the VM?
Hi @malmans2 ,
Thank you very much for the quick response. I have been working on our machine because we may want to experiment with extended regions or periods in the future, which may require additional resources. I appreciate the tips and will update my environment accordingly. Another reason I haven't been using the VM is that I have been ill, and I believe I can only access the VM through my office computer (the one where I shared the public SSH key).
It would be great if you could cache the notebook in the VM. Additionally, could you review the caching functions I am using for calculating the rolling variances (.rolling)? I wouldd also appreciate any insights on the "time" handling within those functions. I have a feeling I might not be working as efficiently as possible.
It would also be nice if you could also review the timeseries (last section), since it is pretty new.
OK, I started downloading data on the VM. I'll look into this tomorrow if we have enough data.
I wouldd also appreciate any insights on the "time" handling within those functions.
Let's start with this, is your goal to use 360 days calendars? If yes, we can just use this function: https://docs.xarray.dev/en/stable/generated/xarray.DataArray.convert_calendar.html
Let me know the parameters you'd like to use (i.e., calendar
and align_on
)
My goal is to calculate the window width for each model, as each has a different length. Once I calculate the rolling variances, I select the last day of each year (season), since I only need one value per year (in other words, the step between one rolling variance value and the next is one year; however, I couldn’t find a step parameter to control this directly). For some models, which follow a 360-day calendar, this day falls on August 30, while for others, it’s August 31. This creates an issue when concatenating, as it leads to double of times in comparison to what I would like. That's why I change at the end from 31st of August to 30th, to account for this
OK, let's chat about it later.
I don't fully understand why you can change the calendar in various places (e.g., remove Feb 29th
or convert to proleptic_gregorian
), but you can't change it at the very beginning so everything is aligned.
@almarbo, this is the snippet I showed you during the meeting:
import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")
ds_var = (
ds.rolling(time=365 * 4)
.construct("window_dim")
.groupby("time.year")
.last()
.var("window_dim")
)
Is that the operation you need to perform?
And here is another test:
import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")
ds_var_mattia = (
ds.rolling(time=365 * 4)
.construct("window_dim")
.groupby("time.year")
.last()
.var("window_dim", keep_attrs=True)
)
ds_var_albert = (
ds.rolling(time=365 * 4)
.var()
.groupby("time.year")
.last()
)
xr.testing.assert_allclose(ds_var_mattia, ds_var_albert)
Hi @malmans2,
I haven’t had time to test this yet but will do so ASAP.
For now, I’ve improved and updated everything and incorporated the adjustments we discussed earlier (time issues, unit conversions, reducing lines using loops, etc.).
I’ve also included the coefficient of variation calculation: the square root of the variance divided by the mean (always using moving windows):
## Apply rolling variance over the calculated window size and calculate the coefficient of variation
rolling_var = ds.rolling(time=days_in_window, min_periods=days_in_window, center=False).var()
rolling_mean= ds.rolling(time=days_in_window, min_periods=days_in_window, center=False).mean()
rolling_CV=(rolling_var**(1/2))/rolling_mean
Thanks again, and enjoy the long weekend!
For now, I’ve improved and updated everything and incorporated the adjustments we discussed earlier (time issues, unit conversions, reducing lines using loops, etc.).
Did you send me the new version?
oh...sorry I didn't realise that I did not update the notebook here. I have now updated it, thanks for noticing
Hi @almarbo,
After revising your functions, I was able to run the notebook on the VM successfully.
Your version significantly increases memory usage and the kernel dies, so I can't really test it. For example, I noticed that the last plot in your notebook starts from 2005 rather than 2004.
Here are the functions I changed:
def compute_rolling_variance(ds, window_years, timeseries, resample_reduction):
(da,) = ds.data_vars.values()
days_in_window = (360 if timeseries == "annual" else 90) * window_years
time = ds["time"].groupby("time.year").last()
grouped = (
da.rolling(time=days_in_window)
.construct("window_dim")
.groupby("time.year")
.last()
.assign_coords(time=time)
.swap_dims(year="time")
.drop_vars("year")
)
var = grouped.var("window_dim", skipna=False)
coeff = (var ** (1 / 2)) / grouped.mean("window_dim")
var *= (1000 if resample_reduction else 86400) ** 2
ds = xr.merge([var, coeff.rename("coefficient_of_variation")])
return ds.expand_dims(variance=[window_years])
def compute_variance(
ds,
timeseries,
year_start,
year_stop,
variance_values,
resample_reduction=None,
request_grid_out=None,
**regrid_kwargs,
):
# Apply resample reduction if specified
if resample_reduction:
resampled = ds.resample(time="1D")
ds = getattr(resampled, resample_reduction)(keep_attrs=True)
# Select the time series
ds = select_timeseries(ds, timeseries, year_start, year_stop)
ds = ds.convert_calendar("360_day", align_on="year", use_cftime=True)
# Compute variances
datasets = [
compute_rolling_variance(
ds,
window_years=variance,
timeseries=timeseries,
resample_reduction=resample_reduction,
)
for variance in variance_values
]
ds = xr.concat(datasets, "variance")
# Handle grid output if requested
if request_grid_out:
grid_out = get_grid_out(request_grid_out, method=regrid_kwargs["method"])
ds = diagnostics.regrid(ds, grid_out=grid_out, **regrid_kwargs)
return ds.convert_calendar("proleptic_gregorian", align_on="date", use_cftime=False)
Here is the notebook executed: https://gist.github.com/malmans2/22f2c270b226053e2281270f14b7581e
There are a couple of things I don't fully understand, but I only focused on optimising the new code:
Could you please test it? If everything works, let me know once you've finalised the notebook if you need me to cache the whole timeseries on the VM.
hi @malmans2,
Since I’m using 2- and 3-year windows, variance calculation requires at least 2 years for the 2-year window and 3 years for the 3-year window. In 2004, there’s only one year of data available, so the 2-year variance can’t be computed. Similarly, in 2005, there are only two years, which is insufficient for the 3-year variance. Consequently, the first plot (Fig. 7a) starts in 2005, while Fig. 7b begins in 2006.
For example, I noticed that the last plot in your notebook starts from 2005 rather than 2004.
The initial idea was to allow either conservative or bilinear interpolation, so I tried parametrising the interpolation method. However, when using the parameterised approach, I encountered issues with some models in model_datasets returning NaNs. Strangely, without parametrising, this problem disappeared. I haven’t identified why this happens yet, as I intended to test it further, but I forgot to mention this issue to you.
Why do you parametrise the interpolation method but then hardcode it to bilinear in the code?
The variance has units, while the coefficient of variation does not, as it’s a dimensionless ratio (the square root of the variance divided by the mean). When calculating the coefficient of variation, units cancel out naturally, so no further conversion is needed. This is why only the units for variance are adjusted, not for the coefficient of variation.
Why do you convert the variance units without also adjusting the coefficients accordingly?
For example, I noticed that the last plot in your notebook starts from 2005 rather than 2004.
OK, I see. Because we use construct
now, you might have to explicitly remove those years. Can you implement it or do you need me to look into it?
Why do you parametrise the interpolation method but then hardcode it to bilinear in the code?
I see, it's just std/mean.
Why do you parametrise the interpolation method but then hardcode it to bilinear in the code?
I tried the parameterised interpolation method with bilinear yesterday and didn’t encounter any issues. So, the problem might only occur with the conservative interpolation. What you’re doing now isn’t ideal: you’re caching a function with the conservative interpolation method, but then using bilinear. Please try again, and let me know if you experience any issues.
Actually, the first point might be quite easy. There's no NaNs in the original dataset right? If that's the cose, we can just do this:
var = grouped.var("window_dim", skipna=False)
Indeed, that works as expected. Here is the latest version: https://gist.github.com/malmans2/22f2c270b226053e2281270f14b7581e
I also restored the parametrisation of the interpolation, it works fine with bilinear.
I just tried conservative and I don't see any issue, but maybe I'm overlooking something. I think you have everything you need to finalise your notebook.
Hi @malmans2 ,
I've been testing, and some aspects of the output seem strange. I’m seeing NaNs in certain interpolated datasets (but not for all models or for every latitude and longitude). The number of models displaying NaNs also depends on the interpolation method.
Here’s an example of what I mean:
With bilinear interpolation:
ds_interpolated.isel(variance=0,latitude=3,longitude=3).pr.values
array([[ nan, nan, nan, nan,
nan, nan, nan, nan,
nan, nan, nan],
[ nan, 1.48577905e+00, 1.36083412e+00, 1.08807281e-01,
3.28417659e-01, 3.76656085e-01, 3.55924129e+00, 3.81952238e+00,
9.70415711e-01, 1.00730753e+00, 4.21003699e-01],
[ nan, nan, nan, nan,
nan, nan, nan, nan,
nan, nan, nan],
[ nan, 2.50745702e+00, 3.43969371e-03, 1.58406961e+00,
1.78741062e+00, 8.41390550e-01, 8.22588563e-01, 1.99909782e+00,
3.04484844e-01, 2.79019605e-02, 3.49222729e-03],
[ nan, nan, nan, nan,
nan, nan, nan, nan,
nan, nan, nan],
[ nan, 5.02978325e-01, 5.02960443e-01, 6.16614008e-04,
6.65174040e-04, 1.43549478e-04, 2.03579810e-04, 6.53892348e-05,
2.39252537e-01, 2.39732191e-01, 8.81395754e-05],
[ nan, 2.79195714e+00, 1.11923683e+00, 1.87317148e-01,
2.73526609e-01, 1.34938931e+00, 5.44377714e-02, 2.19866753e-01,
1.14071131e+00, 1.13717651e+00, 5.12892865e-02],
[ nan, nan, nan, nan,
nan, nan, nan, nan,
nan, nan, nan],
[ nan, nan, nan, nan,
nan, nan, nan, nan,
nan, nan, nan],
[ nan, nan, nan, nan,
nan, nan, nan, nan,
nan, nan, nan],
[ nan, nan, nan, nan,
nan, nan, nan, nan,
nan, nan, nan],
[ nan, nan, nan, nan,
nan, nan, nan, nan,
nan, nan, nan],
[ nan, 6.49854660e-01, 5.87625384e-01, 2.75233746e-01,
7.17305779e-01, 1.70369816e+00, 1.79918849e+00, 1.23120618e+00,
1.07741749e+00, 2.40537190e+00, 5.01404762e-01],
[ nan, nan, nan, nan,
nan, nan, nan, nan,
nan, nan, nan],
[ nan, nan, nan, nan,
nan, nan, nan, nan,
nan, nan, nan],
[ nan, nan, nan, nan,
nan, nan, nan, nan,
nan, nan, nan]])
With conservative interpolation:
ds_interpolated.isel(variance=0,latitude=3,longitude=3).pr.values
array([[ nan, 1.30008769e+00, 2.57578313e-01, 3.82155299e-01,
2.74047434e-01, 9.88518912e-03, 2.64352083e+00, 2.63979268e+00,
3.79363485e-02, 1.05725668e-01, 2.44640529e-01],
[ nan, 5.43833733e-01, 5.25143802e-01, 1.04788408e-01,
1.47605419e-01, 1.53546199e-01, 4.65563631e+00, 4.78511286e+00,
2.69868761e-01, 5.15363932e-01, 4.58529919e-01],
[ nan, 1.87336933e-02, 7.21187368e-02, 6.20284975e-01,
6.13317847e-01, 1.31843731e-01, 3.85855651e-03, 6.07993156e-02,
7.67587423e-02, 2.60598026e-02, 1.93415617e-03],
[ nan, 2.31074333e+00, 2.25445884e-03, 1.53524625e+00,
1.72679913e+00, 8.12479436e-01, 8.24728251e-01, 1.91649163e+00,
2.16389090e-01, 1.92284733e-02, 2.63395743e-03],
[ nan, 1.28151095e-02, 1.27671622e-02, 3.14679360e+00,
3.28719354e+00, 1.87367558e-01, 3.00795659e-02, 3.14509292e-04,
1.11408126e+00, 6.10319555e-01, 1.68544865e+01],
[ nan, 4.62359756e-01, 4.62344140e-01, 1.95181783e-04,
2.19962385e-04, 5.83723631e-05, 1.93975793e-04, 8.65517577e-05,
1.62397370e-01, 1.62802055e-01, 8.52702215e-05],
[ nan, 2.34785438e+00, 8.20793211e-01, 1.86102048e-01,
2.55484730e-01, 1.25649166e+00, 5.41776232e-02, 1.69797778e-01,
8.91150653e-01, 8.61929059e-01, 2.73648407e-02],
[ nan, 1.86399981e-01, 2.32090592e-01, 3.38725716e-01,
2.93454528e-01, 1.95196182e-01, 1.71690449e-01, 8.24425891e-02,
2.87002828e-02, 3.17514271e-01, 3.88155967e-01],
[ nan, 4.28708017e-01, 1.84207693e-01, 2.20691875e-01,
2.77562469e-01, 3.49615157e-01, 3.00025254e-01, 1.49142861e-01,
5.55958934e-02, 3.77535187e-02, 5.44080853e-01],
[ nan, 6.51404258e-05, 1.06267067e-05, 1.08483185e-01,
1.31084071e-01, 6.12673530e-03, 2.86407169e-05, 3.47949835e-05,
1.08073877e-04, 3.51361283e-04, 5.34706576e-04],
[ nan, 2.60324210e-01, 7.60723501e-02, 1.89131409e-01,
9.84028876e-02, 1.49680808e-01, 1.69381708e-01, 5.09373099e-02,
2.79927440e-02, 1.22439079e-01, 1.56387925e-01],
[ nan, nan, nan, nan,
nan, nan, nan, nan,
nan, nan, nan],
[ nan, 5.58600128e-01, 5.56065738e-01, 3.46008599e-01,
3.79052073e-01, 7.08569229e-01, 3.79988253e-01, 3.75964731e-01,
2.58975639e-03, 1.50425577e+00, 3.48897845e-01],
[ nan, 3.52193737e+00, 2.40626359e+00, 1.34040967e-01,
1.32518172e-01, 1.39978873e-02, 1.44275429e-03, 7.04050530e-04,
4.87617683e-04, 1.32263839e+00, 1.32290649e+00],
[ nan, 2.02681409e-03, 2.03517085e-03, 2.01257476e-04,
2.83534233e-04, 6.94739702e-01, 1.70598502e+00, 2.76057728e-02,
1.83888156e-02, 3.33575320e-02, 8.18527854e-02],
[ nan, 2.64513004e-03, 1.00604992e-03, 2.63605452e-05,
5.96900424e-03, 5.96777303e-03, 2.99748429e-03, 8.16623196e-02,
7.91179016e-02, 8.49343371e-03, 9.30850208e-03]])
I have not found nans for the non interpolated data, so that makes me think that it may have to do with the interpolation?
Hi @almarbo,
I believe we explored this issue when working with the previous templates. The NaNs likely occur because some of the cutouts are smaller than the grid you’re interpolating to. To avoid this, you can add a buffer around the cutout area or adjust the interpolation settings (e.g., extend the cutout area by 1 degree in each direction, or remove 1 degree after interpolating)
Try this, which I think makes it clear what's going on:
ds_interpolated["pr"].isel(variance=0).isnull().sum("time").plot(col="model", col_wrap=4)
This also explains why different interpolations result in different NaNs, as each interpolation method has its own requirements regarding neighbouring points.
Also, in case it's not clear, the sum of NaNs is not zero because we did not drop those years where we don't have enough data for the rolling reduction. If you need to drop them, this is the syntax: .dropna("time", how="all")
Thanks @malmans2 ,
totally forgot about this. In fact when we plot we set the extent so that we don't show the edges.
I will continue working on our local machine until we receive the second round of feedback. The methods outlined here should, in principle, be final, but as we discussed in our meeting, feedback can sometimes depend on the results and may require slight adjustments to the methodology, period, area, or variance windows to better address the the user question. I will let you know in case we need further help. Thanks a lot
Notebook description
This notebook downloads daily precipitation data from CMIP6 and calculates rolling variances for 10- and 30-year periods over the historical period from 1940 to 2014 and time aggregation of JJA. It displays the spatial patterns of the average 10- and 30-year variances globally (or over India if processing is too intensive). Finally, it presents a time series of the 10- and 30-year variances for spatially averaged values over India. The notebook’s structure closely follows that of this assessment, though simplified, as it only includes CMIP6 precipitation data without index calculations—just rolling variances over 10- and 30-year windows.
I have prepared an example to enable us to iterate and refine it before posting the issue on the ECMWF GitHub. This example covers the period from 2004 to 2014, with maps displayed for the region of India (ideally, the final version will be global if feasible). The rolling variance windows are set to 2 and 3 years in this example, given the considered period for this example is only 10 years. Please let me know if anything is unclear, or feel free to arrange a meeting if needed.
Notebook link or upload
example_cmip6_pr_variability.zip
Anything else we need to know?
No response
Environment