Open o1lo01ol1o opened 5 years ago
Yeah! Cross-validation is an excellent next step. When working on #22, I was trying to get a rough lay-of-the-land and didn't want to overcomplicate the PR. Toy CV benchmarks like MNIST and the CIFARs pre-split test and train, so I opted not to add scope creep.
I was hoping that all of the partitionings would operate on Vector Int
s and passed into Dataloader
s. The idea was that, given a Dataset
, someone could write a function:
splits
:: Vector Int -- ^ dataset's index
-> testspec -- ^ TBD
-> trainspec -- ^ TBD
-> (Vector Int, Vector Int) -- ^ a test and train split of the indexes
And then these Vector Int
splits could be passed into a Dataloader's shuffle
field, which just uses Data.Vector.backpermute under the hood (here).
I didn't have time to follow up on this, but I was also thinking that it might be nice to refactor Datasets
to have a unified streaming API and only have the Dataloader
handle transforms and shuffling (which might change the API a smidge).
Looking over the
Dataloader
code, I immediately thought about integrating a private dataset to play with some haskell code. This made me wonder if anyone has thought about adding a cross-validation layer on top of it. For some cannonical datasets, there are defined splits (test, train, validation), but for others, one would need to define these.It would be nice if there were some code that could allow one to partition some given data according to
k-folds
andleave-p-out
. In the case of timeseries datasets, you'd have to make sure the that the partitions respect the temporal ordering.