Unidata / MetPy

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.
https://unidata.github.io/MetPy/
BSD 3-Clause "New" or "Revised" License
1.24k stars 413 forks source link

Using MetPy to split up testing/training/validation xarray datasets for Machine Learning #3579

Open ThomasMGeo opened 1 month ago

ThomasMGeo commented 1 month ago

What should we add?

Creating testing/training/validation datasets is a key step in machine learning workflows. Usually for Climate/Weather ML analysis, we split these datasets on a time dimension.

Scikit-learn has a function that does this for 2D arrays / pandas dataframes here. This function can't split xarray datasets.

Improvements on the scikit-learn implementation:

  1. Built for xarray datasets
  2. Can create a validation dataset (a third dataset) instead of doing it in two lines
  3. Can split datasets up in a useful way for time series analysis (do not split up datasets randomly for time series analysis!)

Big questions:

  1. Where should this go?
  2. can we use Xr.dataset.parse_cf() in a smart way to pull the time dimension automagically? This might not be required anyways.

Reference

No response

ThomasMGeo commented 1 month ago

@anacmontoya and I have been working on a function/notebook that might be a good starting point for this work.

anacmontoya commented 1 month ago

https://gist.github.com/anacmontoya/35156d81fec1fe790b67916d2339d793

Here's the code!

sethmcg commented 1 month ago

1) As a completely naive user, I would expect to find this functionality in the Xarray integration section.

2) That seems like it should be easy. If the data is CF-compliant, you can look first for the coordinate with an axis attribute of "T", then for one with standard_name "time".

sethmcg commented 1 month ago

As for what to add, a few things jump out at me from a climate modeling perspective:

I don't think MetPy needs to fully support all of these, but it would be good to have a way of specifying the splits that could accommodate them.

ThomasMGeo commented 1 month ago

All great points!

In the inception of this function, was mainly trying to match the scikit-learn interface/output of train_test_split but for xarray.

Most of your requests I think are straightforward enough using .isel (could argue the same for the proposed function :) ).

I do like the idea of adding even/odd year, or more advanced sampling that is not as easily done.

sethmcg commented 1 month ago

I think a lot of it could be handled by just allowing the user to specify a list of elements for each split instead of the boundaries. What would be really keen is if those lists could contain just years, instead of the full set of datetimes within each year.

The next step beyond that would then be to allow the user to change the date when the year begins/ends, so that you could use water years or winters or whatever depending on what you're studying...