Closed sandrocalmanti closed 4 months ago
Hi @sandrocalmanti,
I glanced through the description, very clear! I have a few doubts here and there, but I think I got the overall goal.
As I have time to dedicate to this project this week, we could go the other way around if that works with you. In other words, I could implement each step one by one, checking with you regularly (i.e., as soon as I'm done implementing a step, I'll share here the implementation and I'll update the template). If we go this way, I think we can start using real data right away (although we are going to start with small samples).
If that works with you, I'll start implementing step 1. Otherwise, no problem at all, I will review your notebook as usual (and I'll think about how to mock point 7)
In the mean time, a couple of questions for ERA5:
Assuming temperature as the variable of interest, the computation of hit-rates entails the following steps:
- Compute mean monthly temperature anomalies tERA5(a,m,y) from the reanalysis over SREX areas, where m=1,12 (month) and y=1996,2013 (year in hindcast period), and a=1,8 indicates the SREX area,
How is anomaly defined? I guess anomaly compared to the global mean? If yes, unweighted global mean? So something like this: ds - ds.mean(("latitude", "longitude"))
?
- For each month, compute the values associated to the 33th and 66th percentiles, respectively P33ERA5(m) and P66ERA5(m), of the corresponding monthly distribution. Each area a is characterized by a different P33 and P66
For each month, and for each area as well? I.e., should it be P{33,66}ERA5(a,m)
?
- Assign each month in the record to the corresponding tercile by comparing each value with the corresponding P33ERA5(m) and P66ERA5(m). We obtain, for each area a the corresponding array of tercile occupation TRERA5(a,m,y).
OK
Hi @sandrocalmanti,
I had some spare time this morning so I tried an implementation. Here it is: https://gist.github.com/malmans2/a789959236e0bbb1783800f47e21a90a
The function only covers ERA5 right now, it will probably change a little to make it more general for seasonal forecast.
Ciao @malmans2,
first: good for me. I'm also dedicating this week to the nootbook. Fine to work in close iteration.
On the anomalies of ERA5: the anomalies would be computed with respect to the local climatology, i.e. time mean,
ds - ds.mean(("time"))
Then, the anomaly is averaged over the SREX area, so there is a time series of monthly anomalies averaged over each SREX area.
On the percentiles: yes sorry, my mistake, it's P{33,66}ERA5(a,m)
.
I'll check your code.
On the anomalies of ERA5: the anomalies would be computed with respect to the local climatology, i.e. time mean,
Ohh ok, I need to change that in the notebook I just sent you. This means that the climatology is affected by the time period of the analysis. I don't fully understand how this works for the seasonal forecast anomalies in the CDS. Do they re-compute the anomalies whenever new data is available? If yes, we might have to think carefully how to handle it. For example, we would have to invalidate the cache frequently, and it might be easier to compute anomalies ourselves
ds - ds.mean(("time"))
As we are working with monthly data, do you want to weight the time mean (i.e., account for the number of days in each month)?
Yes, I would assume that the time mean account for the number of days.
I updated the notebook. The climatology is time weighted, whereas the spatial mean is NOT weighted (as we did for the other wp3 notebook).
Thank you @malmans2
In reviewing the other notebooks I have introduced whighting in the spatial mean. Numerically it doesn't make a big difference because the sub-regions are small. However it is correct apply spatial weighting.
Reviewed the last version and the output looks good. More recent years occupy mostly the tercile "above normal", as it should be.
Quick questions, for me to better understand the code.
What is the purpose of chunk(year=-1)
here?
quantiles = ds.chunk(year=-1).quantile([1 / 3, 2 / 3], "year")
S.
Dask, which is used under the hood to scale the computations, does not allow to perform quantiles along chunked dimensions. Try this, and you'll get a very informative error:
import xarray as xr
import numpy as np
da = xr.DataArray(np.random.randn(10, 10)).chunk(1)
da.quantile(1/2, "dim_0") # ValueError
You can make it work removing the chunks along the dim_0 dimension:
da.chunk(dim_0=-1).quantile(1/2, "dim_0")
On the anomalies of ERA5: the anomalies would be computed with respect to the local climatology, i.e. time mean,
Ohh ok, I need to change that in the notebook I just sent you. This means that the climatology is affected by the time period of the analysis. I don't fully understand how this works for the seasonal forecast anomalies in the CDS. Do they re-compute the anomalies whenever new data is available? If yes, we might have to think carefully how to handle it. For example, we would have to invalidate the cache frequently, and it might be easier to compute anomalies ourselves
I've re-checked the data and realized we need to compute the anomalies ourselves.
Just forgot that anomalies are available only for the "forecast" period, i.e. 2017 onwards. Instead, this analysis would cover the 'hindcast' period from 1993 to 2016.
We need to use the same data as in the bias analysis (monthly) means, and we need to compute the anomalies with like:
ds - ds.mean(("time", "realization"))
OK. I'm on it.
The step on seasonal forecast is a little tricky. We are using gribs, and it looks like there's a memory leak.
We did not experience the issue before because we were using transform_chunks=True
(it performs the computation on each chunk and saves the result as netcdf).
We can't do it here because we need the whole timeseries for the anomalies.
I'll work on it in the afternoon. Worst case scenario, we need to convert the grib files to netcdf or download netcdfs.
Now that I see some results, I'm thinking it might be necessary to remove the linear trend before computing the anomalies. Otherwise the hit-rate is inflated by the fact that the forecast captures the trend, which is only marginally useful.
I will check in more details how people handle this in the literature, but let's first continue the procedure until the end. For end users it might even be useful to have both information: what is the hit-rate in a climate change perspective (i.e. retaining the trend) and what is the hit-rate with respect to the anomalies occured in the recent years (i.e. removing the trend).
Looking at the function compute_tercile_occupation(ds, region)
with the idea in mind of removing the trend, I notice that it might be wise to invert the two steps. Instead of:
we can do
It's all linear so they can be swapped. If we introduce the trend removal it would be
This would save a lot of trend computation.
I guess we may at some point introduce a key compute_tercile_occupation(ds, region,remove_trend=FALSE)
for removing the trend.
As said, I woud get the the end of the process, and then we take care of the trend.
Sounds good, and good point about masking first (I forgot to do that when I realised we need to subtract the mean over time). Unfortunately masking first is not going to solve the memory issue, I'm now trying to download seasonal forecast in netcdf format.
Good news, I think I found a pretty good way to make the conversion.
Question about point 4:
- Repeat step 1 for each seasonal forecast model where tSF(c,r,a,rm,m,y) similar to point 1, but in this case the temperature anomaly depends on the originating centre c, on the realization r (ensemble member), and on the reference month rm in which the forecast is initialized. For this step we use the pre-computed monthly anomalies from the CDS.
The original field that we get has dimension "starting_time" rather than "reference_month". Would you like to generate the forecast_month dimension taking the mean? I.e., something like this:
ds = ds.groupby("starting_time.month").mean()
ds = ds.rename(month="reference_month")
One question about the climatology as well:
We need to use the same data as in the bias analysis (monthly) means, and we need to compute the anomalies with like:
ds - ds.mean(("time", "realization"))
For seasonal forecast, ds has also the reference_month
dimension. Would you like to take the average along that dimensions as well? I.e., pick between:
ds - ds.mean(("time", "realization"))
ds - ds.mean(("time", "realization", "reference_month"))
Hi @malmans2,
exactly, we don't need to average over reference time. Let's try to recap what happens with the time dimension to make sure the workflow is clear.
We have three interlinked parameters dealing with time:
For seasonal forecasts, the climatology is a function of both the starting_time and the valid_time (or similarly starting_time and lead_time). For example all the forecasts issued in january would produce 6 climatologies:
The climatology would be an array of length=12, normally. For seasonal forecasts it will be an array of dim=(12,6), which is 12 starting_times, each with 6 associated lead_times. However for the visualization strategy that we use at the end of the process (see the bias assessment), we may want to use arrays of dim=(12,12) where some of the elements are just nans
===
Accordingly, each forecast produces 6x25 (25 realizations) different anomalies. For example, consider the forecast issued in january-2016, it produces:
The array of anomalies would have dimension (12,6,ny,25), 12 starting_times, 6 lead_times, ny years in the hindcast period and 25 members).
===
Similarly for the tercile occupation which has dim=(12,6,ny) after we use the 25 members to extract the most populated tercile. As described for the climatology, we may want to use arrays of dim=(12,12,ny) where some of the elements are just nans.
This last tercile occupation array from seasonal forecasts should be compared with the corresponding ERA5 where the time dimension of ERA5 must be aligned with the corresponding valid_time = starting_time + lead_time in the forecast.
After counting the number of years in which there is a match between ERA5 and SF, and divide by ny, we get the array of hit-rates of dim=(12,12), 12 starting months x 12 valid months of which some will be void. Hit-rates are visualized with the same method of the bias assessment
Hope this helps.
Hi @sandrocalmanti,
I need to leave now, I'll read and go through your last comment tomorrow.
I made good progress though. I found what was causing memory issues (we need to deal with time dimensions ourselves rather than relying on cfgrib veryfing_time
).
Check out the latest version of the notebook (although I need to take a good look at it with fresh eyes).
Ciao @malmans2,
I'm reading through the code and ask questions as they come along.
What happens here?
for shift in set(ds["leadtime_month"].values):
shifted = ds.indexes["valid_time"].shift(shift - 1, "MS")
ds["valid_time"] = ds["valid_time"].where(
ds["leadtime_month"] != shift, shifted
)
Hi @sandrocalmanti,
I just pushed a few changes, it should be more clear now. Check out the latest version.
The shifting step now belongs to a separate function.
def reindex_seasonal_forecast(ds):
# Stack starting_time and leading_month
ds = ds.rename(forecast_reference_time="starting_time")
ds = ds.stack(
time=("starting_time", "leadtime_month"),
create_index=False,
)
# Shift valid_time
ds = ds.set_index(time="starting_time")
valid_time = ds.indexes["time"]
for shift in set(ds["leadtime_month"].values):
shifted = ds.indexes["time"].shift(shift - 1, "MS")
valid_time = valid_time.where(ds["leadtime_month"] != shift, shifted)
# Reindex: valid_time and starting_month
coords = {
"valid_time": ("time", valid_time),
"starting_month": ("time", ds["time"].dt.month.data),
}
ds = ds.assign_coords(coords)
ds = ds.set_index({"time": tuple(coords)}).unstack("time")
return ds
We start from a dataset with dimension (..., forecast_reference_time, leadtime_month)
, and we end up with a dataset with dimensions (..., valid_time, starting_month)
. I think forecast_reference_time
is the starting_time
, whereas leadtime_month
is the shift that we need to apply in order to get valid_time
. However, leadtime_month=1 means starting_time=valid_time
, so the equation would be valid_time = starting_time + (leadtime_month - 1)
. Does it sound right?
I also fixed the anomaly, which now has the dimension starting_month
.
Plots are still quick and dirty, but a little improved to deal with categorical data.
Thank you @malmans2, I'm checking the update
This might be a little confusing:
valid_time = valid_time.where(ds["leadtime_month"] != shift, shifted)
It's the same as this:
valid_time = xr.where(ds["leadtime_month"] == shift, shifted, valid_time)
I added the hit-rate as well, and a plot similar to the one we made for the other template. Here is the latest version of the notebook
Did you want the hit rate in %? It's just a count right now.
Great!
Thank you @malmans2
I think it reflects what I wrote yesterday in words.
The equation for the shifting is correct.
The results in the debug plots are also in line with expectations: as for the reanalysis, the occupation tercile for the seasonal forecast have show a trend towards higher-than-normal in recent years, which again suggests the trend removal might be necessary.
Can you explain how this works? I guess this uses some of the characteristics of xarray right?
hit_rate = (ds_seasonal == ds_reanalysis).sum("valid_year")
And yes, the hit-rate should be in %.
Great job, didn't expect it could be done so fast!
xarray broadcasts the dimensions of the objects compared.
mask = (ds_seasonal == ds_reanalysis)
creates boolean masks for each dataarray in the datasets. Under the hood, xarray broadcasts ds_reanalysis
to ds_seasonal
as it is missing the starting_month
dimension.
mask.sum("valid_year")
reduces the "valid_year" dimension, summing the boolean values (True is 1, False is 0)
Hit-rate is now in percentage, I'll take a look at the trend removal in the aternoon!
Hi @sandrocalmanti,
I added the detrending step (there's a parameter in the top cell that allows you to switch it on/off). Let me know if this is what you had in mind and the implementation is correct.
Here is the latest version of the notebook
Looks great!
And I mean it: great.
Two comments:
SPATIAL MEAN As noted yesterday, I'm wondering if we can move the spatial mean before the detrending. The detrending is now applied to all grid points in the area but we are in fact measuring the hit-rates on the spatial mean. When we apply this to the multimodel and to all areas it may save some time to do the spatial mean first. Is it sufficient to move the spatial mean just after the masking?
SHIFTING
The debug plot on the tercile occupation of the seasonal forecast illustrates well how the individual panels work: some of the forecasts for the valid_months
of january to may are associated with the forecast starting in the preceeding year. As an example, see the transition between 2009 and 2010. Forecasts start with an anomaly "above normal" and reamain "above normal" for the entire season.
Similar to the transtion 2011-2012 when the entire winter is below normal in the forecast.
Now have a look at the hit rates.
There is a (weak) tendency of getting lower hit-rates for longer lead times. It's not a regular signal, there are exceptions (e.g. the forecast starting in april, but it works more or less as expected: longer lead time implies less hit rate.
Something strange happens for the forecasts starting after august. In these cases the hit-rate is reported on the top-left of the figure, and january gets always the lowest hit-rate, regardless of the lead time.
It may well be something peculiar for this area and for this model, something that will not happen when using other data, but for now is worth noting that this is not what I would expect as a result.
Yes! Let's move the spatial mean right after the masking, I think I forgot to move it yesterday. Also, did you say that you want to use weighted spatial means, right?
We are using the exact same data as your other notebook, so I'd start to cache the entire dataset if that works for you. That way we can check whether the weird behaviour occurs in other centres/areas as well.
Yes, let's have a look at the entire dataset.
I've checked the WMO recommendations on the use of hit-rates for assessing seasonal forecasts and there might be a few minor adjustements to make. But first let's look at the big picture.
Sounds good. The job is running, I'll let you know when it's done.
All set, here it is. I added a simple debug-plot at the end to show a specific variable/centre/region, but we can easily customize it to show various panels depending on your needs.
Let's touch base tomorrow. Buona serata!
Grazie @malmans2
I'm thinking the debug-plots will be infact part of the assessment in the final notebook because the describe well some distinctive aspects of how seasonal-forecast work. I alse see that the issue I noted yesterday on the low hit-rates in january disappears in other areas. Good.
== The final touch to this nice notebook is one small adjustment I would do to the hit-rate count in order to follow the WMO recommendation.
Instead of counting all tercile matches I would exclude those where ERA5 and the forecast match on the normal conditions, i.e. if the model make a hit on conditions close to climatology, that's not necessarily an interesting hit.
I guess this can be done here
hit_rate = (ds_seasonal == ds_reanalysis).sum("valid_year")
by subsetting those cases where ds_reanalysis != 0
.
How would you best implement this adjustment?
Yes, we can just do this (we don't even need to invalidate the cache).
hit_rate = ((ds_seasonal == ds_reanalysis) & (ds_reanalysis != 0)).sum("valid_year")
In a meeting now, I will change it as soon as it's done and I will remove the WIP status from the notebook/template. Do you want me to cache the other variables as well?
@sandrocalmanti actually, one more question, how do you want to compute the percentage? Do you want to count "normal" values in the denominator or not? Option 1 (hit-rates will be lower):
hit_rate = ((ds_seasonal == ds_reanalysis) & (ds_reanalysis != 0)).sum("valid_year")
hit_rate *= 100 / ds_seasonal.notnull().sum("valid_year")
Option 2 (hit-rates will be higher):
masked_ds_seasonal = ds_seasonal.where(ds_seasonal.isin([-1, 1]))
hit_rate = (masked_ds_seasonal == ds_reanalysis).sum("valid_year")
hit_rate *= 100 / masked_ds_seasonal.notnull().sum("valid_year")
@sandrocalmanti actually, one more question, how do you want to compute the percentage? Do you want to count "normal" values in the denominator or not?
This
Option 2 (hit-rates will be higher):
masked_ds_seasonal = ds_seasonal.where(ds_seasonal.isin([-1, 1])) hit_rate = (masked_ds_seasonal == ds_reanalysis).sum("valid_year") hit_rate *= 100 / masked_ds_seasonal.notnull().sum("valid_year")
In a meeting now, I will change it as soon as it's done and I will remove the WIP status from the notebook/template. Do you want me to cache the other variables as well?
Yes, we need to look at other variables, which implies that we need to transform cumulated variables as done for the bias assessment.
S.
Ohh ok, I thought units didn't matter here as we are using percentiles and returning unitless fields.
So, we need to convert precipitation to mm/month
right after the spatial weighted mean step, correct?
Wait, you're right.
The conversion is pointless here because the cumulation is already done, it's just a different unit. We don't need to convert.
OK, I think everything is implemented.
I'm now caching all other variables.
Hi @sandrocalmanti,
Everything is cached, I updated the notebooks. Let me know if everything is in good shape.
Hi @malmans2,
looks great and I think I have everything I need to complete this part of the quality assessment. Thank you for dedicating your time to this notebook during the last week. Great job indeed.
You may close this issue if you want and I can come back and re-open just in case it is really needed.
Grazie!
S.
Perfect, always fun and interesting working with you!
Hi @sandrocalmanti,
This template is also ready:
Dear @malmans2
I'm moving forward for the finalization of this notebook after the discussion in Lisbon and I would like to compare detrended and not-detrended data. Is it possible for you to cache the data by using
# Detrend timeseries
detrend = False
The corresponding data with detrend=True
are already cached and I will proceed with the analysis after the new data are available.
You don't need to modify the code for now, since I'm not sure yet about where the analysis will get us.
Just caching the data will be more than enough for now.
I did a test on a few samples and I'm getting a lot of error messages on unit conversion. Is it something we should worry about?
OK, I will run the notebook as it is but with detrend=False
I did a test on a few samples and I'm getting a lot of error messages on unit conversion. Is it something we should worry about?
I guess you meant warnings? Short answer, don't worry about it. Long answer, these are warning messages thrown by the harmonisation software, developed and maintained by ECMWF. I already reported this issue. They will disappear when the data is cached.
It's flagged as "Error" but it's more appropriately a "Warning" since the process doesn't stop.
Traceback:
Unable to convert from 'Unit('1')' to 'Unit('months')'.
Error while converting 1 to months for forecastMonth.
Units will not be converted.
yup, looks familiar. Don't worry about it
Notebook description
Dear @malmans2,
we agreed with ECMWF on preparing a notebook focused on the hit-rate of seasonal forecasts using a similar strategy as for the assessment of the bias.
Before I start working on the notebook I would like to share with you the overall plan because it seems to me non-trivial with respect to other notebooks we have prepared together. I'd like to be sure we are on the same page about where to go and, most importantly, that I don't do big mistakes in planning the code so we end up in situations that are more difficult to manage.
So here's the plan and I have a couple of question question for you at the end.
INTRODUCTION
The general idea is simple: to compute the hit-rates, the monthly climatology of the selected ECV (e.g. temperature) is divided in terciles (below normal, normal, above normal) and we count the number of events when the most populated tercile derived from the distribution of the seasonal forecasts corresponds to the observed tercile of the target month.
Seasonal forecasts are assessed against ERA5
PROCESS ERA5
Assuming temperature as the variable of interest, the computation of hit-rates entails the following steps:
Assign each month in the record to the corresponding tercile by comparing each value with the corresponding $P33{ERA5}(a,m)$ and $P66{ERA5}(a,m)$. We obtain, for each area $a$ the corresponding array of tercile occupation $TR_{ERA5}(a,m,y)$. For example the codification of tercile occupation can be:
PROCESS SEASONAL FORECASTS
Repeat step 1 for each seasonal forecast model where $t_{SF}(c,r,a,rm,m,y)$ similar to point 1, but in this case the temperature anomaly depends on the originating centre $c$, on the realization $r$ (ensemble member), and on the reference month $rm$ in which the forecast is initialized. For this step we use the pre-computed monthly anomalies from the CDS.
Repeat step 2 for each originating centre and for each month in order to obtain the position of the terciles $P33{SF}(c,a,rm,m)$ and $P66{SF}(c,a,rm,m)$. In this case the underlying distribution for the computation of the terciles includes all years in the hindcasts, as well as all ensemble members. The $P33$ and $P66$ threshold maintain their dependence on the reference months so that the following steps account for potential model drift.
For each year $y$ in the hindcast and for each reference time $rm$, compute the most populated tercile. This is done by counting the number of realizations $r$ in $t{SF}(c,r,a,rm,m,y)$ (see step 4) inside each interval delimited by $P33{SF}(c,a,rm,m)$ and $P66{SF}(c,a,rm,m)$. The result is an array of tercile occupation $TR{SF}(c,a,rm,m,y)$
COMPUTE HIT-RATES
Hit-rates $H(c,a,rm,m)$ are computed by counting the number of years y in which the tercile occupied by the reanalsys corresponds to the most populated tercile in the seasonal forecasts:
$TR{ERA5}(a,m,y) = TR{SF}(c,a,rm,m,y)$.
Hit-rates $H$ depend on the target month $m$ on the reference month $rm$ in which forecasts are initialized, on the selected SREX area $a$, and on the originating centre $c$
A WORKPLAN
Here's a workplan and a couple of questions for you.
First of all: is the description of the procedure clear enough? Does it make sense to you?
Second: I would proceed by implementing steps 1. to 6. based on fake random arrays instead of the actual data. Can you thin of a simple way to do step 7. without looping over all elements in the array? Do you agree working on fake data to structure the algorithm?
Let me know your thoughts on the above.
Notebook link or upload
No response
Anything else we need to know?
No response
Environment