Add dataset without streaming

Changes

This PR enables using data with mds-format (mosaic-streaming format) stored locally without using StreamingTextDataset. The new class NoStreamingDataset is slimmer, enables higher throughput, and allows to avoid shortcomings with StreamingTextDataset such as uneven memory allocations over GPU:s (described in this issue), as well shared memory issues.

To use NoStreamingDataset, streaming should be set to false in the yaml-configs (see example below):

train_loader:
  name: text
  dataset:
    streaming: false

Tests

I have managed to reproduce very similar loss curves with and without streaming for both text data and pre-tokenized data in a multi-gpu setup, using the flex-bert-base.yaml-config. As we use different sampling for the data loader with and without streaming, I did not manage to get the exact same data ordering for my comparisons with and without streaming - which together with non-determinism in flash attention is why the loss curves were not fully identical (but nonetheless very similar). With the flex-bert-base.yaml-config, I get ~20 % higher throughput without streaming for text-data, and ~25 % higher throughput without streaming for pretokenized data.

I have not had time to adapt any other tests for this PR, and will be unavailable for the next 9 days. I might be able to answer to comments on this PR, but if new commits or additional tests are found necessary I won't be able to do it during this time. So feel free to contribute with additional commits if necessary while I am away :)

AnswerDotAI / bert24

Add dataset without streaming #85