DataHaskell / dh-core

Functional data science
138 stars 23 forks source link

Cross validation layer #42

Open o1lo01ol1o opened 5 years ago

o1lo01ol1o commented 5 years ago

Looking over the Dataloader code, I immediately thought about integrating a private dataset to play with some haskell code. This made me wonder if anyone has thought about adding a cross-validation layer on top of it. For some cannonical datasets, there are defined splits (test, train, validation), but for others, one would need to define these.

It would be nice if there were some code that could allow one to partition some given data according to k-folds and leave-p-out. In the case of timeseries datasets, you'd have to make sure the that the partitions respect the temporal ordering.

stites commented 5 years ago

Yeah! Cross-validation is an excellent next step. When working on #22, I was trying to get a rough lay-of-the-land and didn't want to overcomplicate the PR. Toy CV benchmarks like MNIST and the CIFARs pre-split test and train, so I opted not to add scope creep.

I was hoping that all of the partitionings would operate on Vector Ints and passed into Dataloaders. The idea was that, given a Dataset, someone could write a function:

splits
  :: Vector Int     -- ^ dataset's index
  -> testspec       -- ^ TBD
  -> trainspec      -- ^ TBD
  -> (Vector Int, Vector Int)  -- ^ a test and train split of the indexes

And then these Vector Int splits could be passed into a Dataloader's shuffle field, which just uses Data.Vector.backpermute under the hood (here).

I didn't have time to follow up on this, but I was also thinking that it might be nice to refactor Datasets to have a unified streaming API and only have the Dataloader handle transforms and shuffling (which might change the API a smidge).