databrickslabs / tempo

API for manipulating time series on top of Apache Spark: lagged time values, rolling statistics (mean, avg, sum, count, etc), AS OF joins, downsampling, and interpolation
https://pypi.org/project/dbl-tempo
Other
310 stars 53 forks source link

Timeseries split #414

Closed tnixon closed 2 weeks ago

tnixon commented 3 weeks ago

Changes

Created a sub-class of the PySpark ML CrossValidator (see https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidator.html#pyspark.ml.tuning.CrossValidator) that replicates the timeseries split method implemented by SKLearn's TimeSeriesSplit (see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)

Linked issues

Resolves #409

Functionality

Tests

I created an example notebook where I use the new Cross-Validator to train a GBT regression model on some sample data. Notebook included as a reference example.

tnixon commented 3 weeks ago

Yeah... I guess that would be a good idea 🙄 😄