eth-easl / modyn

Modyn is a research-platform for training ML models on growing datasets.
MIT License
25 stars 3 forks source link

Allow users to specify whether to only shuffle partition order or data within partitions #460

Open MaxiBoether opened 3 months ago

MaxiBoether commented 3 months ago

In #456, we introduce a shuffle pipeline parameter which both shuffles the order of partitions within workers and the data within partitions for more randomness. However, shuffling within a partition requires us to fetch the entire partition first before yielding to the data loader. We need to investigate the performance overhead of this when running Criteo and CLOC. We should offer the option to specify different variants of shuffling instead of having a boolean option between shuffling as much as possible and not shuffling at all.

When implementing more lightweight shuffling, we could think about supporting shuffling on a file level at the storage. We can order by file id but then handle the files randomly. For single sample files, this would have the same effect as buffering the entire partition but would allow the early-yield logic.