0-5788719150923125 / vtx

an experiment
Other
8 stars 1 forks source link

DataLoader handling is very flawed #22

Closed Vectorrent closed 1 month ago

Vectorrent commented 7 months ago

Because reload_dataloaders_every_n_epoch is set, all DataLoaders are reloaded at every epoch. However, they are ALSO reloaded at each validation interval - which typically means they are shuffled far more often than each epoch.

While this might be okay with static datasets, it's really bad with dynamic/streaming datasets. Because dynamic datasets always start at the beginning of a shard, and because there are often just 1-5 shards per dataset, we land in a situation where we just keep resetting back to the exact same place at each epoch/val_interval. Worse, we end up comparing the same batches to each other.

Vectorrent commented 7 months ago

For now, I can just set reload_dataloaders_every_n_epochs=0, and DataLoaders will only ever be loaded once - at the start of training. This is not a perfect solution, because exhausted StaticDatasets are not shuffled after the first epoch - meaning that batches are always compared against the same data. We want to shuffle every epoch, such that batches are always compared against random batches.

Nevertheless, it's more important to ensure that we're exploring the large, supplemental StreamingDatasets completely, today.

Vectorrent commented 1 month ago

This has been largely addressed. I moved all the disparate loaders into two, unified DataModules: 1) LocalDataset and 2) StreamingDataset. We can extend those as-needed.