Open sethiay opened 8 months ago
Hey, Just want to check if someone got chance to look into this ?
@sethiay I will try and look into this by the end of this week (probably on Saturday :) )
@sethiay The code was fine. The configuration was incorrect, where we were not converting the shuffle value from conf file into enum.
For your question, currently DLIO uses Random Sampling from PyTorch. This generates a random order for the range. In general the files should not be repeated till the number of files are divisible by (num_ranksnum_workersbatch_size). Seems like the default way to do this using Random Sampling doesn't shard the data across GPUs. The recommended way to to write your own sampler. #154 Shows this implementation. Can u check if this PR solves your issue?
For your second point, that was the configuration bug. I fixed that in #154.
Thank you @hariharan-devarajan for the fixes !
Can u check if this PR solves your issue?
I tried and I can still see some repeated files being read (though the number of repetitions have reduced, earlier I could see 7-8 but now 2). Also, I couldn't understand why we are multiplying with self.epochs
in end = ((self.rank + 1) * samples_per_gpu) * self.epochs
. IIUC, the sampler should work without this multiplication ?
So the sampler need to keep generating numbers till a certain point. As it's our own sampler, if we don't take epochs into consideration then it will just iterate over dataset once and stop. I tried it without the epoch and it stoped after one epoch of data.
Also, its surprising for me u saw only 7-8. What I saw was the pytorch sampler was not sharding the dataset. Therefore, in an epoch it was repeating samples across all MPI processes.
PyTorch workers were unique but the samples read per GPU were exactly the same.
If your # workers * # GPU is a multiple of # of total samples you should not see any repeats. I tested it with your 5000 file case where I had 2 GPUs and 10 workers then 10 GPUs 2 workers. In both cases for 1 epoch I didn't see any repeats.
IIUC, DataLoader is recreated in every epoch (so is the sampler) and class dlio_sampler(Sampler):
is not a batch sampler i.e. it returns one value per next(dlio_sampler), I think the below
samples_per_gpu = self.num_samples // self.size
start = self.rank * samples_per_gpu
end = ((self.rank + 1) * samples_per_gpu) * self.epochs
for i in range(start, end):
yield indices[i % self.num_samples]
should be something like
samples_per_gpu = self.num_samples // self.num_gpus
start = self.rank * samples_per_gpu
end = min(self.num_samples, ((self.rank + 1) * samples_per_gpu))
for i in range(start, end):
yield indices[i]
If your # workers * # GPU is a multiple of # of total samples you should not see any repeats. I tested it with your 5000 file case where I had 2 GPUs and 10 workers then 10 GPUs 2 workers. In both cases for 1 epoch I didn't see any repeats.
If my understanding is correct, the dataset is divided by the number of GPUs i.e. if there are 5000 samples and 5 GPUs, then each GPU will get 1000 samples. Now, if batch_size and num_workers are set to say 2 and 10 then each GPU will have 10 threads/processes running where each thread/proecess will continuously read samples in set of 2 (internally each worker will call next(sampler) two times to get 2 indices and then call dataset.getitem(index) for the two indices to get the two samples of a batch).
set of 2
They dont read a set of 2 they return one item each and its then batched by the Torch Dataloader.
set of 2
They dont read a set of 2 they return one item each and its then batched by the Torch Dataloader.
Okay !
Even then samples_per_gpu = self.num_samples // self.size
should ideally be computed by samples_per_gpu = self.num_samples // self.num_gpus
? I am not expert in Pytorch so please correct my understanding if required.
size here is the comm_size which is the number of gpus
We have a recent fix #160 which might solve this issue also.
Hey,
I am using the below command to do I/O benchmark:
I am facing two issues:
sample_shuffle
is off butseed
.off
as mentioned in the command above but then found that this checkself._args.sample_shuffle != Shuffle.OFF
here is coming out to betrue
. I believe this check should befalse
.Request you to look into the above issues and let me know if you need more info. Thanks !