New preprocessor: Distance metrics dataset vs. reference

ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.

https://www.esmvaltool.org

Apache License 2.0

42 stars 38 forks source link

New preprocessor: Distance metrics dataset vs. reference #2266

Closed schlunma closed 5 months ago

schlunma commented 10 months ago

Is your feature request related to a problem? Please describe.

To facilitate the comparison of different models against each other or against obersvations, it would be really helpful to have a preprocessor that calculates distance metrics of two datasets. Example for distance metrics can be the RMSE or the Pearson correlation coefficient.

I am thinking of the following signature for the new preprocessor:

def distance_metric(
    products: set[PreprocessorFile] | Iterable[Cube],
    metric: MetricType,
    reference: Optional[Cube] = None,
    coords: Iterable[Coord] | Iterable[str] | None = None,
    keep_reference_dataset: bool = True,
    **kwargs,
) -> set[PreprocessorFile] | CubeList:

When used within a recipe, reference datasets should be marked with reference_for_metric: true. This signature also allows an easy usage of the preprocessor outside of ESMValTool

Would you be able to help out?

Yes!

@axel-lauer @hb326 @LisaBock

valeriupredoi commented 10 months ago

yes I think @bouweandela plopped a Pearson test in an example recipe once. And I remember I barked at the poor guy that we should probably look at a Kolmogorov-Smirnoff distance plot better than a Pearson one :grin:

schlunma commented 9 months ago

Draft PR is available here: #2299

bouweandela commented 8 months ago

Would reference instead of ref_cube work as a name for the reference cube/dataset? Avoiding abbreviations makes things easier to understand for newcomers (who, e.g. when using just a recipe, may not even know what a cube is).

schlunma commented 8 months ago

Would reference instead of ref_cube work as a name for the reference cube/dataset? Avoiding abbreviations makes things easier to understand for newcomers (who, e.g. when using just a recipe, may not even know what a cube is).

I actually like having the expected type of the variable in its name (here: cube). Also, this argument is not necessary in the recipe, but only if you want to use that function outside of ESMValTool in a Python script or notebook. Users that do that should be (more) experienced Python users, so I would expect them to be capable of reading API docs, which explain the variable in detail. Thus, would reference_cube instead of ref_cube work for you?

bouweandela commented 8 months ago

But what if at some point in the future we would like to make it possible that the reference could be something else too? Using the type in the name was quite popular at some point (just Google Hungarian notation), not it's not considered good practice anymore, mostly because it makes it difficult to change code over time.

schlunma commented 8 months ago

Well, then it seems that almost all of our preprocessor functions use that bad practice (most use the argument cube). In general, I agree that Hungarian notation is problematic, but calling a variable ref_cube can IMHO hardly be considered as following the Hungarian notation principles.

Anyway, will change this.

bouweandela commented 8 months ago

Well, then it seems that almost all of our preprocessor functions use that bad practice (most use the argument cube)

I know, I couldn't think of anything better at the time.

schlunma commented 8 months ago

Renamed in https://github.com/ESMValGroup/ESMValCore/pull/2299/commits/58dd3c53bffad63fad037c02c3a9df8f28114a2d. I also renamed the corresponding argument of the bias preprocessor in https://github.com/ESMValGroup/ESMValCore/pull/2299/commits/b36af0153ad6d790658c6aadf0c3cc17afa99374. Since this hasn't been released yet, we don't need a deprecation cycle.