Webdataset reader behavior with many sources

NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html

Apache License 2.0

5.11k stars 615 forks source link

Webdataset reader behavior with many sources #5429

Open evgeniishch opened 6 months ago

evgeniishch commented 6 months ago

Describe the question.

nvidia.dali.fn.readers.webdataset supports reading from multiple tar files, specified as a list of paths

How is reading from multiple sources performed? Are all sources read sequentially one after another? What happens when random_shuffle parameter is set to True? Are samples drawn to buffer from one source or from all sources with some distribution?

Thank you

Check for duplicates

[X] I have searched the open bugs/issues and have found no duplicates for this bug report

JanuszL commented 6 months ago

Hi @evgeniishch,

Thank you for reaching out. Answering your questions:

How is reading from multiple sources performed? Are all sources read sequentially one after another?

Abstracting away sharding (where each pipeline is assigned to a separate, non-overlapping shard of data) reading is done in sequence in each pipeline.

What happens when random_shuffle parameter is set to True? Are samples drawn to buffer from one source or from all sources with some distribution?

DALI uses an internal buffer of fixed size (initial_fill parameter) where data is read sequentially, and then when the batch is created this buffer is randomly sampled. The expectation from the data sets in containers (RecordIO, TFRecord, or webdataset) is they are preschuffled to avoid grouping samples belonging to one class so the first batch may have a very small representation (regarding classes) compared to the whole dataset.

CoinCheung commented 1 month ago

@JanuszL Hi, if there are multi-gpus to train, the data is expected to be splitted in to multi-parts. Will the splitted parts come from the same internal buffer(initial_fill) that is shared across the gpus, or each gpu maintains its own internal buffers?

JanuszL commented 1 month ago

Hi @CoinCheung,

Each DALI pipeline keeps its internal shuffling buffer. The data split happens on the shard level, where each DALI pipeline should be assigned dot a separate shard. This is achieved via shard_id and number_of_shards arguments.

CoinCheung commented 1 month ago

@JanuszL HI, what do you mean "The data split happens on the shard level" ? If I have 10 tar files, whose names are file1.tar file2.tar ..., and I have two gpus, will gpu1 only read from "file1.tar file3.tar file5.tar file7.tar file9.tar", while gpu2 only read from "file2.tar file4.tar file6.tar file8.tar file10.tar" ?

Or will both the gpus read from same tar file for the amount of samples limited by initial_fill ?

By the way, if we set initial_fill=4096, and we have 8 gpus, will each gpu maintain an internal buffer size of 4096, or 512 (which is 4096 / 8, divide by num of gpus) ?

JanuszL commented 1 month ago

HI, what do you mean "The data split happens on the shard level" ? If I have 10 tar files, whose names are file1.tar file2.tar ..., and I have two gpus, will gpu1 only read from "file1.tar file3.tar file5.tar file7.tar file9.tar", while gpu2 only read from "file2.tar file4.tar file6.tar file8.tar file10.tar" ?

The readers first index the files, calculate the number of samples (as the files don't have to have the same amount of samples inside), and then split the data between GPUs equally. So if the files are equal regarding the number of samples, then the first GPU gets 1-5, and the second 6-10.

By the way, if we set initial_fill=4096, and we have 8 gpus, will each gpu maintain an internal buffer size of 4096, or 512 (which is 4096 / 8, divide by num of gpus) ?

The size is per pipeline, so each DALi instance (GPU) will have 4096 sized buffer for shuffling.