Modalities / modalities

Modalities, a PyTorch-native framework for distributed and reproducible foundation model training.
MIT License
61 stars 5 forks source link

Dataloader with fixed size #180

Closed le1nux closed 3 months ago

le1nux commented 3 months ago

What does this PR do?

We can now specify the number of batches per rank per dataloader directly in the config. Whenever we don't want to iterate over the entire dataloader, e.g., for benchmarking or smaller models, we can specify fixed_num_batches in the dataloader. Additionally, in the settings, we can add global_num_train_tokens, allowing the specify the global number of training tokens after which the model stops the training. We have implemented a conversion routine, which calculates fixed_num_batches from global_num_train_tokens.

General Changes

Breaking Changes

Checklist before submitting final PR