Closed jeblad closed 3 years ago
@jeblad we have on the roadmap the split functionality, but curious to learn more about salts and how would you see using filters. Can you show an example code?
How to implement it depends on whether the order of the chunks will change (it probably will), the content of the chunks can change (it probably won't), and the samples have ids themselves. (1) I guess I would go for the simpler solution and use the provided salt and create a new one with the commit id and chunk id. Then I would draw whether to include a specific sample from the given chunk. (This will be a proper randomized selection, it is just forced to be the same selection due to the salt.)
That would be sufficient for most cases, unless samples can be added to chunks (which I understand you don't do). (2) If that is the case I would make a new recalculated salt, hash of the sample, and move all thruthy (or falsy) to the training set. That is pretty slow, but will maintain existing inclusions. Recalculation is simply to take an integer inverse, thus a small fraction going to test set will make the modulus operation hit zero with similar frequency. (This is not a proper randomized selection, but it will recreate a selection except for changes to the set.)
The important thing here is not to create an ideal distribution of samples for training and testing, but to recreate the same set on each invocation. To make a split with guaranteed similar distribution is darn hard, and I'm not sure if it is even possible with anything except very simple cases.
The problem is; imagine there is a hidden chaotic sequence of samples that has a particular meaning. When you draw samples from that set and unknown to you the pseudorandom sequence hits the chaotic sequence, then your resulting outcome could be radically different. I'm not sure how this can be avoided.
In my opinion: Recreating the sequence β yes, recreating the distribution β no.
gotcha! @AbhinavTuli what do you think?
@jeblad your solution looks interesting. A couple of points from my side:-
ds = Dataset("url", schema=my_schema", splits={"train": 0.8, "test": 0.2}, shape=(1000,))
ds_train = ds.get_split("train") # only has the first 800 samples
This should cover a lot of use cases and data retrieval from consecutive chunks should be efficient.
Hey @jeblad we will push this along - have often do you run into reproducibility issues while using hub? (Asking so we can figure out the right abstraction for the api.)
Closed due to little interest from community. We can reopen when necessary.
π¨π¨ Feature Request
In some cases you want to try out a specific learning algorithm, and you want to lock (or fix) the split in test and training, so it can be verified at a later time. When you have locked the split all samples extracted for training and test should be exactly the same, given that the stored (committed) dataset is the same. If new items are added, then those items should split the same way for everyone that tries to download the set in this particular locked incarnation.
I believe the best way to implement this is to use a common salt, and that salt should make the split unambiguously.
Note that this can be implemented as a filter, but I believe it is best to do it as part of core functionality to do it consistently for all use cases. It has to do with repeatability when someone checks someone else's published work.