activeloopai / deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
https://activeloop.ai
Mozilla Public License 2.0
8.09k stars 622 forks source link

[FEATURE] Lock split in test and training with a β€œsalt” argument #569

Closed jeblad closed 3 years ago

jeblad commented 3 years ago

🚨🚨 Feature Request

In some cases you want to try out a specific learning algorithm, and you want to lock (or fix) the split in test and training, so it can be verified at a later time. When you have locked the split all samples extracted for training and test should be exactly the same, given that the stored (committed) dataset is the same. If new items are added, then those items should split the same way for everyone that tries to download the set in this particular locked incarnation.

I believe the best way to implement this is to use a common salt, and that salt should make the split unambiguously.

Note that this can be implemented as a filter, but I believe it is best to do it as part of core functionality to do it consistently for all use cases. It has to do with repeatability when someone checks someone else's published work.

davidbuniat commented 3 years ago

@jeblad we have on the roadmap the split functionality, but curious to learn more about salts and how would you see using filters. Can you show an example code?

jeblad commented 3 years ago

How to implement it depends on whether the order of the chunks will change (it probably will), the content of the chunks can change (it probably won't), and the samples have ids themselves. (1) I guess I would go for the simpler solution and use the provided salt and create a new one with the commit id and chunk id. Then I would draw whether to include a specific sample from the given chunk. (This will be a proper randomized selection, it is just forced to be the same selection due to the salt.)

That would be sufficient for most cases, unless samples can be added to chunks (which I understand you don't do). (2) If that is the case I would make a new recalculated salt, hash of the sample, and move all thruthy (or falsy) to the training set. That is pretty slow, but will maintain existing inclusions. Recalculation is simply to take an integer inverse, thus a small fraction going to test set will make the modulus operation hit zero with similar frequency. (This is not a proper randomized selection, but it will recreate a selection except for changes to the set.)

The important thing here is not to create an ideal distribution of samples for training and testing, but to recreate the same set on each invocation. To make a split with guaranteed similar distribution is darn hard, and I'm not sure if it is even possible with anything except very simple cases.

The problem is; imagine there is a hidden chaotic sequence of samples that has a particular meaning. When you draw samples from that set and unknown to you the pseudorandom sequence hits the chaotic sequence, then your resulting outcome could be radically different. I'm not sure how this can be avoided.

In my opinion: Recreating the sequence – yes, recreating the distribution – no.

davidbuniat commented 3 years ago

gotcha! @AbhinavTuli what do you think?

AbhinavTuli commented 3 years ago

@jeblad your solution looks interesting. A couple of points from my side:-

  1. Choosing certain samples from a chunk and discarding the rest could be a little wasteful as the entire chunk would still be retrieved. While that is what happens after filtering, ideally we should further store the output of the filter as a separate dataset itself if we want to train on it efficiently. (Storing it would store everything in consecutive chunks). Alternatively, a better way might be to simply take or discard all the samples in a given chunk.
  2. Could you elaborate on why you feel that the order of chunks would change?
  3. Samples can also be added to chunks btw. For instance, if we're creating a chunk of 5 images and we assign only 3. Later we can come back and add the remaining 2 images to the same chunk
  4. What are your thoughts on something simpler like:-
    ds = Dataset("url", schema=my_schema", splits={"train": 0.8, "test": 0.2}, shape=(1000,))
    ds_train = ds.get_split("train") # only has the first 800 samples

    This should cover a lot of use cases and data retrieval from consecutive chunks should be efficient.

jeblad commented 3 years ago
  1. Don't take my rough sketch as any advice on how to store the down-filtered chunks! =)
  2. Most storage engines don't retain order for blobs unless you somehow force it to do so. I don't have enough knowledge about your solution on that point. Recreating a salt for a specific chunk seems to be a low-cost operation, so I would guess the cost of making the code a bit more solid is worth it.
  3. If you can add samples to chunks, then my second method (I've added numbers in the post above) is what would retain the sequence as close as possible. All previous samples will be filtered the same way as before, recreating the sequence, but any new samples will be added to the sequence according to their processing.
  4. This does not recreate the sequence, just some sequence with (hopefully) the same properties?
mynameisvinn commented 3 years ago

Hey @jeblad we will push this along - have often do you run into reproducibility issues while using hub? (Asking so we can figure out the right abstraction for the api.)

mynameisvinn commented 3 years ago

Closed due to little interest from community. We can reopen when necessary.