verificaton brainstorming

konradmayer commented 1 year ago

issue for brainstorming and material collection

konradmayer commented 1 year ago

mlr3 Dictionary of Performance Measures

https://mlr3.mlr-org.com/reference/mlr_measures.html

konradmayer commented 1 year ago

https://mlr3measures.mlr-org.com/index.html

r3xth0r commented 1 year ago

A hint at spatiotemporal resampling: https://mlr3spatiotempcv.mlr-org.com/

seblehner commented 1 year ago

As denoted therein (https://ml4physicalsciences.github.io/2019/files/NeurIPS_ML4PS_2019_75.pdf) and in other literature: power spectral density (PSD) could be more beneficial to evaluate high-resolution features than MSE or PSNR (peak signal to noise ratio).

konradmayer commented 1 year ago

Verification script of the spanish team is to be found at: https://github.com/ECMWFCode4Earth/DeepR/tree/main/deepr/validation/netcdf

and conducts calculation of well established skill scores for individual coordinates:

https://github.com/ECMWFCode4Earth/DeepR/blob/main/deepr/validation/netcdf/metrics.py#L50-L56

konradmayer commented 1 year ago

I did some testing on radially averaged PSD using the r package {radialpsd}.

For lead time 12 with a radially averaged 2D fourier transform per timestep (912 in total) first plots look as follows (line is mean over all timesteps, shaded area is minmax range)

konradmayer commented 1 year ago

Is an explorative approach sufficient here, or do we need to derive a skill score?
how to compare the different lead times (facets?)?
would a 3D FT be better suited (not supported by the package)?
scaling and normalization necessary?
how to treat non-square domains (truncating or padding?)? - is windowing (i.e. padding) a good idea in any case to avoid artifacts in the high frequency area?
needs to be done on projected data? (otherwise wavenumber/distance not valid?)

mdaber commented 1 year ago

I am not familiar with the score but i guess a skill score is mainly needed when the differences between the methods are small, if they are big enough i dont think it is necessary, similar with scaling and normalizing. Regarding the lead times: maybe we can have a summarizing score for the lead times and show a lead time wise graphic and use the power spectrum just at one or two lead times (night or day). Just for my understanding: does PSD penalize a bias?

konradmayer commented 1 year ago

probably its also better not to logarithmize wavenumber for easier interpretation:

konradmayer commented 1 year ago

here for comparison the plot with logarithmic x axis

(the last two plots were without scaling and normalization, the one in https://github.com/ECMWFCode4Earth/tesserugged/issues/6#issuecomment-1695269604 was with both)

konradmayer commented 1 year ago

@r3xth0r, do you think that comparing variograms among cerra and models is a useful addition/alternative to PSD?

konradmayer commented 1 year ago

pushed my first tests on PSD with https://github.com/ECMWFCode4Earth/tesserugged/commit/011f6635e2214f9b6fa165a802b7d94ffe308306 to its own experimental branch - any suggestions and ideas are very welcome

konradmayer commented 1 year ago

Just did some tests with variography. I aggregated the time steps (to seasons) as its computationally much more expensive than PSD. Here's a test output for lead_time 12.

This is for testing only, but we would interpret here that spatial variablility is underestimated by the downscaled data (samos in this case) in summer, but overestimated in the other seasons as compared to cerra. In general, the shape of the variograms is more or less reproduced

I think this analysis (same for PSD) is generally only valid for projected coordinates, as otherwise distance is not uniform across space - which is not the case for this as well as above plots! (added this point to the list in https://github.com/ECMWFCode4Earth/tesserugged/issues/6#issuecomment-1695277450; @r3xth0r, any thoughts on this?)

here unit of x is degrees, in the PSD plots above wavenumber is diagonal_domain_size(in px)^-1

konradmayer commented 1 year ago

However, aren't we especially interested in the distances < ERA5 pixel size (which is 0.25), which is not covered at all by the above variograms (first bin at 0.54). is it even reasonable to derive a variogram with small enough bins to learn something about the low distance variability we are mainly interested in?

r3xth0r commented 1 year ago

(1) CRS: you are right. The effect of using geographic instead of projected coordinates might be negligible on small AOIs, but could lead to be considerable on a continental scale.

(2) I am somewhat unsure about the added value of using variograms here. We probably would need to consider anisotropy to some extent, but this might not be straightforward, as does not occur consisntly across the whole area. Probably it's sufficient to stick to PSD.

(3) A 3D FT would probably be better suited indeed, but I doubt that the additional effort of a manual implementation is really worth it.

konradmayer commented 1 year ago

Here PSD as in https://github.com/ECMWFCode4Earth/tesserugged/issues/6#issuecomment-1695269604, but stratified by season

konradmayer commented 1 year ago

Inspired by yesterdays meeting (thanks, @mc4117) I added PSD for (bilinear interpolated) ERA5 to this analysis - this is what it looks like for individual timesteps:

konradmayer commented 1 year ago

we can clearly see that the power spectrum of the downscaled field (samos) is closely following the PS of the CERRA data, and ERA5 showing bigger differences

konradmayer commented 1 year ago

this is also the case when averaging over all timesteps:

to get an idea of the variation of these power spectra heres median, IQR and 0.05-0.95 range instead of the mean:

alternatively - mean by season:

all plots above for lead time 12

mc4117 commented 1 year ago

Thanks! It's great to see the comparison

ECMWFCode4Earth / tesserugged

verificaton brainstorming #6