Open konradmayer opened 1 year ago
mlr3
Dictionary of Performance Measures
A hint at spatiotemporal resampling: https://mlr3spatiotempcv.mlr-org.com/
As denoted therein (https://ml4physicalsciences.github.io/2019/files/NeurIPS_ML4PS_2019_75.pdf) and in other literature: power spectral density (PSD) could be more beneficial to evaluate high-resolution features than MSE or PSNR (peak signal to noise ratio).
Verification script of the spanish team is to be found at: https://github.com/ECMWFCode4Earth/DeepR/tree/main/deepr/validation/netcdf
and conducts calculation of well established skill scores for individual coordinates:
https://github.com/ECMWFCode4Earth/DeepR/blob/main/deepr/validation/netcdf/metrics.py#L50-L56
I did some testing on radially averaged PSD using the r package {radialpsd}
.
For lead time 12 with a radially averaged 2D fourier transform per timestep (912 in total) first plots look as follows (line is mean over all timesteps, shaded area is minmax range)
I am not familiar with the score but i guess a skill score is mainly needed when the differences between the methods are small, if they are big enough i dont think it is necessary, similar with scaling and normalizing. Regarding the lead times: maybe we can have a summarizing score for the lead times and show a lead time wise graphic and use the power spectrum just at one or two lead times (night or day). Just for my understanding: does PSD penalize a bias?
probably its also better not to logarithmize wavenumber for easier interpretation:
here for comparison the plot with logarithmic x axis
(the last two plots were without scaling and normalization, the one in https://github.com/ECMWFCode4Earth/tesserugged/issues/6#issuecomment-1695269604 was with both)
@r3xth0r, do you think that comparing variograms among cerra and models is a useful addition/alternative to PSD?
pushed my first tests on PSD with https://github.com/ECMWFCode4Earth/tesserugged/commit/011f6635e2214f9b6fa165a802b7d94ffe308306 to its own experimental branch - any suggestions and ideas are very welcome
Just did some tests with variography. I aggregated the time steps (to seasons) as its computationally much more expensive than PSD. Here's a test output for lead_time 12.
This is for testing only, but we would interpret here that spatial variablility is underestimated by the downscaled data (samos in this case) in summer, but overestimated in the other seasons as compared to cerra. In general, the shape of the variograms is more or less reproduced
I think this analysis (same for PSD) is generally only valid for projected coordinates, as otherwise distance is not uniform across space - which is not the case for this as well as above plots! (added this point to the list in https://github.com/ECMWFCode4Earth/tesserugged/issues/6#issuecomment-1695277450; @r3xth0r, any thoughts on this?)
here unit of x is degrees, in the PSD plots above wavenumber is diagonal_domain_size(in px)^-1
However, aren't we especially interested in the distances < ERA5 pixel size (which is 0.25), which is not covered at all by the above variograms (first bin at 0.54). is it even reasonable to derive a variogram with small enough bins to learn something about the low distance variability we are mainly interested in?
(1) CRS: you are right. The effect of using geographic instead of projected coordinates might be negligible on small AOIs, but could lead to be considerable on a continental scale.
(2) I am somewhat unsure about the added value of using variograms here. We probably would need to consider anisotropy to some extent, but this might not be straightforward, as does not occur consisntly across the whole area. Probably it's sufficient to stick to PSD.
(3) A 3D FT would probably be better suited indeed, but I doubt that the additional effort of a manual implementation is really worth it.
Here PSD as in https://github.com/ECMWFCode4Earth/tesserugged/issues/6#issuecomment-1695269604, but stratified by season
Inspired by yesterdays meeting (thanks, @mc4117) I added PSD for (bilinear interpolated) ERA5 to this analysis - this is what it looks like for individual timesteps:
we can clearly see that the power spectrum of the downscaled field (samos) is closely following the PS of the CERRA data, and ERA5 showing bigger differences
this is also the case when averaging over all timesteps:
to get an idea of the variation of these power spectra heres median, IQR and 0.05-0.95 range instead of the mean:
alternatively - mean by season:
all plots above for lead time 12
Thanks! It's great to see the comparison
issue for brainstorming and material collection