Closed Vectorrent closed 1 month ago
For now, I can just set reload_dataloaders_every_n_epochs=0
, and DataLoaders will only ever be loaded once - at the start of training. This is not a perfect solution, because exhausted StaticDatasets are not shuffled after the first epoch - meaning that batches are always compared against the same data. We want to shuffle every epoch, such that batches are always compared against random batches.
Nevertheless, it's more important to ensure that we're exploring the large, supplemental StreamingDatasets completely, today.
This has been largely addressed. I moved all the disparate loaders into two, unified DataModules
: 1) LocalDataset
and 2) StreamingDataset
. We can extend those as-needed.
Because
reload_dataloaders_every_n_epoch
is set, all DataLoaders are reloaded at every epoch. However, they are ALSO reloaded at each validation interval - which typically means they are shuffled far more often than each epoch.While this might be okay with static datasets, it's really bad with dynamic/streaming datasets. Because dynamic datasets always start at the beginning of a shard, and because there are often just 1-5 shards per dataset, we land in a situation where we just keep resetting back to the exact same place at each epoch/val_interval. Worse, we end up comparing the same batches to each other.