AnswerDotAI / bert24

Apache License 2.0
66 stars 4 forks source link

Add dataset without streaming #85

Closed ohallstrom closed 3 months ago

ohallstrom commented 4 months ago

Changes

This PR enables using data with mds-format (mosaic-streaming format) stored locally without using StreamingTextDataset. The new class NoStreamingDataset is slimmer, enables higher throughput, and allows to avoid shortcomings with StreamingTextDataset such as uneven memory allocations over GPU:s (described in this issue), as well shared memory issues.

To use NoStreamingDataset, streaming should be set to false in the yaml-configs (see example below):

train_loader:
  name: text
  dataset:
    streaming: false

Tests

I have managed to reproduce very similar loss curves with and without streaming for both text data and pre-tokenized data in a multi-gpu setup, using the flex-bert-base.yaml-config. As we use different sampling for the data loader with and without streaming, I did not manage to get the exact same data ordering for my comparisons with and without streaming - which together with non-determinism in flash attention is why the loss curves were not fully identical (but nonetheless very similar). With the flex-bert-base.yaml-config, I get ~20 % higher throughput without streaming for text-data, and ~25 % higher throughput without streaming for pretokenized data.

I have not had time to adapt any other tests for this PR, and will be unavailable for the next 9 days. I might be able to answer to comments on this PR, but if new commits or additional tests are found necessary I won't be able to do it during this time. So feel free to contribute with additional commits if necessary while I am away :)

NohTow commented 3 months ago

For the summary, the code works if the decompressed data is there. Instead of decompressing on the fly and dealing with locks and other issues (that could hurt the performances anyway), we decided to keep this behavior of only working with decompressed files and thus require to pre-decompress everything before training. Oskar added a proper warning and the script needed to decompress the datasets.

Otherwise, LGTM, I am running a larger scale sanity run overnight but it should be good to merge (and won't hurt anything in the meantime anyways)