Timeseries split - Githubissues

databrickslabs / tempo

API for manipulating time series on top of Apache Spark: lagged time values, rolling statistics (mean, avg, sum, count, etc), AS OF joins, downsampling, and interpolation

https://pypi.org/project/dbl-tempo

Other

310 stars 53 forks source link

Timeseries split #414

Closed tnixon closed 2 weeks ago

tnixon commented 3 weeks ago

Changes

Created a sub-class of the PySpark ML CrossValidator (see https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidator.html#pyspark.ml.tuning.CrossValidator) that replicates the timeseries split method implemented by SKLearn's TimeSeriesSplit (see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)

Linked issues

Resolves #409

Functionality

[ ] added relevant user documentation
[X] added a new Class method
[ ] modified existing Class method: ...
[ ] added a new function
[ ] modified existing function: ...
[ ] added a new test
[ ] modified existing test: ...
[X] added a new example
[ ] modified existing example: ...
[ ] added a new utility
[ ] modified existing utility: ...

Tests

I created an example notebook where I use the new Cross-Validator to train a GBT regression model on some sample data. Notebook included as a reference example.

[X] manually tested
[ ] added unit tests
[x] added integration tests
[ ] verified on staging environment (screenshot attached)

tnixon commented 3 weeks ago

Yeah... I guess that would be a good idea 🙄 😄