Open ThomasMGeo opened 1 month ago
@anacmontoya and I have been working on a function/notebook that might be a good starting point for this work.
https://gist.github.com/anacmontoya/35156d81fec1fe790b67916d2339d793
Here's the code!
1) As a completely naive user, I would expect to find this functionality in the Xarray integration section.
2) That seems like it should be easy. If the data is CF-compliant, you can look first for the coordinate with an axis attribute of "T", then for one with standard_name "time".
As for what to add, a few things jump out at me from a climate modeling perspective:
I don't think MetPy needs to fully support all of these, but it would be good to have a way of specifying the splits that could accommodate them.
All great points!
In the inception of this function, was mainly trying to match the scikit-learn interface/output of train_test_split but for xarray.
Most of your requests I think are straightforward enough using .isel
(could argue the same for the proposed function :) ).
I do like the idea of adding even/odd year, or more advanced sampling that is not as easily done.
I think a lot of it could be handled by just allowing the user to specify a list of elements for each split instead of the boundaries. What would be really keen is if those lists could contain just years, instead of the full set of datetimes within each year.
The next step beyond that would then be to allow the user to change the date when the year begins/ends, so that you could use water years or winters or whatever depending on what you're studying...
What should we add?
Creating testing/training/validation datasets is a key step in machine learning workflows. Usually for Climate/Weather ML analysis, we split these datasets on a time dimension.
Scikit-learn has a function that does this for 2D arrays / pandas dataframes here. This function can't split xarray datasets.
Improvements on the scikit-learn implementation:
Big questions:
Reference
No response