Seems like different threads are reading some overlapping files in torch data loader

sethiay commented 8 months ago

Hey,

I am using the below command to do I/O benchmark:

mpirun -np 8 ~/.local/bin/dlio_benchmark workload=unet3d ++workload.train.epochs=1 ++workload.workflow.profiling=True ++workload.profiling.profiler=iostat ++workload.profiling.iostat_devices=[md0] ++workload.dataset.data_folder="../unet3d-data-mounted/" ++workload.dataset.num_files_train=5000 ++workload.reader.batch_size=4 ++workload.dataset.record_length=149989507 ++workload.reader.read_threads=10 ++workload.reader.sample_shuffle=off

I am facing two issues:

Seems like after this commit, different threads of torch data loader are reading some overlapping files. I added a print statement here and found some of the file names are printed more than once in the same epoch. If I understand correctly, there should be no overlapping of files being read by different threads ?
The document says that the default for sample_shuffle is off but
- I added a print statement here and found that default is seed.
- Also, I tried to explicitly put off as mentioned in the command above but then found that this check self._args.sample_shuffle != Shuffle.OFF here is coming out to be true. I believe this check should be false.

Request you to look into the above issues and let me know if you need more info. Thanks !

sethiay commented 8 months ago

Hey, Just want to check if someone got chance to look into this ?

hariharan-devarajan commented 7 months ago

@sethiay I will try and look into this by the end of this week (probably on Saturday :) )

hariharan-devarajan commented 7 months ago

@sethiay The code was fine. The configuration was incorrect, where we were not converting the shuffle value from conf file into enum.

For your question, currently DLIO uses Random Sampling from PyTorch. This generates a random order for the range. In general the files should not be repeated till the number of files are divisible by (num_ranksnum_workersbatch_size). Seems like the default way to do this using Random Sampling doesn't shard the data across GPUs. The recommended way to to write your own sampler. #154 Shows this implementation. Can u check if this PR solves your issue?

For your second point, that was the configuration bug. I fixed that in #154.

sethiay commented 7 months ago

Thank you @hariharan-devarajan for the fixes !

Can u check if this PR solves your issue?

I tried and I can still see some repeated files being read (though the number of repetitions have reduced, earlier I could see 7-8 but now 2). Also, I couldn't understand why we are multiplying with self.epochs in end = ((self.rank + 1) * samples_per_gpu) * self.epochs. IIUC, the sampler should work without this multiplication ?

hariharan-devarajan commented 7 months ago

So the sampler need to keep generating numbers till a certain point. As it's our own sampler, if we don't take epochs into consideration then it will just iterate over dataset once and stop. I tried it without the epoch and it stoped after one epoch of data.

hariharan-devarajan commented 7 months ago

Also, its surprising for me u saw only 7-8. What I saw was the pytorch sampler was not sharding the dataset. Therefore, in an epoch it was repeating samples across all MPI processes.

PyTorch workers were unique but the samples read per GPU were exactly the same.

If your # workers * # GPU is a multiple of # of total samples you should not see any repeats. I tested it with your 5000 file case where I had 2 GPUs and 10 workers then 10 GPUs 2 workers. In both cases for 1 epoch I didn't see any repeats.

sethiay commented 7 months ago

IIUC, DataLoader is recreated in every epoch (so is the sampler) and class dlio_sampler(Sampler): is not a batch sampler i.e. it returns one value per next(dlio_sampler), I think the below

        samples_per_gpu = self.num_samples // self.size
        start = self.rank * samples_per_gpu
        end = ((self.rank + 1) * samples_per_gpu) * self.epochs
        for i in range(start, end):
            yield indices[i % self.num_samples]

should be something like

        samples_per_gpu = self.num_samples // self.num_gpus
        start = self.rank * samples_per_gpu
        end = min(self.num_samples, ((self.rank + 1) * samples_per_gpu))
        for i in range(start, end):
            yield indices[i]

sethiay commented 7 months ago

If your # workers * # GPU is a multiple of # of total samples you should not see any repeats. I tested it with your 5000 file case where I had 2 GPUs and 10 workers then 10 GPUs 2 workers. In both cases for 1 epoch I didn't see any repeats.

If my understanding is correct, the dataset is divided by the number of GPUs i.e. if there are 5000 samples and 5 GPUs, then each GPU will get 1000 samples. Now, if batch_size and num_workers are set to say 2 and 10 then each GPU will have 10 threads/processes running where each thread/proecess will continuously read samples in set of 2 (internally each worker will call next(sampler) two times to get 2 indices and then call dataset.getitem(index) for the two indices to get the two samples of a batch).

hariharan-devarajan commented 7 months ago

set of 2

They dont read a set of 2 they return one item each and its then batched by the Torch Dataloader.

sethiay commented 7 months ago

set of 2

They dont read a set of 2 they return one item each and its then batched by the Torch Dataloader.

Okay !

Even then samples_per_gpu = self.num_samples // self.size should ideally be computed by samples_per_gpu = self.num_samples // self.num_gpus ? I am not expert in Pytorch so please correct my understanding if required.

hariharan-devarajan commented 7 months ago

size here is the comm_size which is the number of gpus

zhenghh04 commented 7 months ago

We have a recent fix #160 which might solve this issue also.

argonne-lcf / dlio_benchmark

Seems like different threads are reading some overlapping files in torch data loader #149