Using MetPy to split up testing/training/validation xarray datasets for Machine Learning

ThomasMGeo commented 1 month ago

What should we add?

Creating testing/training/validation datasets is a key step in machine learning workflows. Usually for Climate/Weather ML analysis, we split these datasets on a time dimension.

Scikit-learn has a function that does this for 2D arrays / pandas dataframes here. This function can't split xarray datasets.

Improvements on the scikit-learn implementation:

Built for xarray datasets
Can create a validation dataset (a third dataset) instead of doing it in two lines
Can split datasets up in a useful way for time series analysis (do not split up datasets randomly for time series analysis!)

Big questions:

Where should this go?
can we use Xr.dataset.parse_cf() in a smart way to pull the time dimension automagically? This might not be required anyways.

Reference

No response

ThomasMGeo commented 1 month ago

@anacmontoya and I have been working on a function/notebook that might be a good starting point for this work.

anacmontoya commented 1 month ago

https://gist.github.com/anacmontoya/35156d81fec1fe790b67916d2339d793

Here's the code!

sethmcg commented 1 month ago

1) As a completely naive user, I would expect to find this functionality in the Xarray integration section.

2) That seems like it should be easy. If the data is CF-compliant, you can look first for the coordinate with an axis attribute of "T", then for one with standard_name "time".

sethmcg commented 1 month ago

As for what to add, a few things jump out at me from a climate modeling perspective:

For the sake of later analysis, we usually split things based on dates, not proportions. So I'd like the ability to specify the splits as, e.g., training = 1980-2012, validate = 2013-2017, test = 2018-2022.
Or 1979-10-01 through 2011-09-30, etc. if you're using water years. So while sometimes you need to specify a full datetime for the split point, it would be nice to be able to just give the year (or year+month) and have it automatically promote from year to year+month to date to date+time as needed.
Climate models often use non-standard calendars, so datetimes should be handled as cftime objects, rather than np.datetime64 objects.
Another possibility is that sometimes you want to do things like train on even years and validate on odd years, and then hold out some other subset for testing, like a chunk at the end, or maybe years divisible by 5.
Or you might want to split randomly by year, but ensure that each split has a decent sampling of the different ENSO phases. (I.e., conditioning the splits based on some external factor.)

I don't think MetPy needs to fully support all of these, but it would be good to have a way of specifying the splits that could accommodate them.

ThomasMGeo commented 1 month ago

All great points!

In the inception of this function, was mainly trying to match the scikit-learn interface/output of train_test_split but for xarray.

Most of your requests I think are straightforward enough using .isel (could argue the same for the proposed function :) ).

I do like the idea of adding even/odd year, or more advanced sampling that is not as easily done.

sethmcg commented 1 month ago

I think a lot of it could be handled by just allowing the user to specify a list of elements for each split instead of the boundaries. What would be really keen is if those lists could contain just years, instead of the full set of datetimes within each year.

The next step beyond that would then be to allow the user to change the date when the year begins/ends, so that you could use water years or winters or whatever depending on what you're studying...

Unidata / MetPy

Using MetPy to split up testing/training/validation xarray datasets for Machine Learning #3579

What should we add?

Reference