Open evgeniishch opened 6 months ago
Hi @evgeniishch,
Thank you for reaching out. Answering your questions:
How is reading from multiple sources performed? Are all sources read sequentially one after another?
Abstracting away sharding (where each pipeline is assigned to a separate, non-overlapping shard of data) reading is done in sequence in each pipeline.
What happens when random_shuffle parameter is set to True? Are samples drawn to buffer from one source or from all sources with some distribution?
DALI uses an internal buffer of fixed size (initial_fill
parameter) where data is read sequentially, and then when the batch is created this buffer is randomly sampled. The expectation from the data sets in containers (RecordIO, TFRecord, or webdataset) is they are preschuffled to avoid grouping samples belonging to one class so the first batch may have a very small representation (regarding classes) compared to the whole dataset.
@JanuszL Hi, if there are multi-gpus to train, the data is expected to be splitted in to multi-parts. Will the splitted parts come from the same internal buffer(initial_fill) that is shared across the gpus, or each gpu maintains its own internal buffers?
Hi @CoinCheung,
Each DALI pipeline keeps its internal shuffling buffer. The data split happens on the shard level, where each DALI pipeline should be assigned dot a separate shard. This is achieved via shard_id
and number_of_shards
arguments.
@JanuszL HI, what do you mean "The data split happens on the shard level" ? If I have 10 tar files, whose names are file1.tar file2.tar ..., and I have two gpus, will gpu1 only read from "file1.tar file3.tar file5.tar file7.tar file9.tar", while gpu2 only read from "file2.tar file4.tar file6.tar file8.tar file10.tar" ?
Or will both the gpus read from same tar file for the amount of samples limited by initial_fill
?
By the way, if we set initial_fill=4096
, and we have 8 gpus, will each gpu maintain an internal buffer size of 4096, or 512 (which is 4096 / 8, divide by num of gpus) ?
HI, what do you mean "The data split happens on the shard level" ? If I have 10 tar files, whose names are file1.tar file2.tar ..., and I have two gpus, will gpu1 only read from "file1.tar file3.tar file5.tar file7.tar file9.tar", while gpu2 only read from "file2.tar file4.tar file6.tar file8.tar file10.tar" ?
The readers first index the files, calculate the number of samples (as the files don't have to have the same amount of samples inside), and then split the data between GPUs equally. So if the files are equal regarding the number of samples, then the first GPU gets 1-5, and the second 6-10.
By the way, if we set initial_fill=4096, and we have 8 gpus, will each gpu maintain an internal buffer size of 4096, or 512 (which is 4096 / 8, divide by num of gpus) ?
The size is per pipeline, so each DALi instance (GPU) will have 4096 sized buffer for shuffling.
Describe the question.
nvidia.dali.fn.readers.webdataset supports reading from multiple tar files, specified as a list of paths
How is reading from multiple sources performed? Are all sources read sequentially one after another? What happens when
random_shuffle
parameter is set toTrue
? Are samples drawn to buffer from one source or from all sources with some distribution?Thank you
Check for duplicates