Deepspeed training dataset does not have sampler

lzl-mt commented 4 months ago

System Info

torch 2.0.1 torchaudio 2.0.2 torchvision 0.15.2

Information

[ ] The official example scripts
[ ] My own modified scripts

🐛 Describe the bug

When using Deepspeed training, compared with DDP training with the same configuration, the total number of steps in each epoch training increased by N times (N is the number of cards). When printing the relevant configuration of the dataset, it was found that there is no sampler. DDP: {'sampler': <torch.utils.data.distributed.DistributedSampler object at 0x7fc99032c640>, 'batch_size': 6, 'drop_last': True, 'collate_fn': <bound method SpeechDatasetJsonl.collator of <speech_dataset.py.SpeechDatasetJsonl object at 0x7fc275f34130>>} Deepspeed: {'batch_size': 6, 'drop_last': True, 'collate_fn': <bound method SpeechDatasetJsonl.collator of <speech_dataset.py.SpeechDatasetJsonl object at 0x7fbee2e324c0>>} It may cause that the data read by each card is exactly the same.

Error logs

the same as above

Expected behavior

Thank you for your outstanding work. Hope can fix this problem and give the time required for Deepspeed and DDP to train an epoch on the default configuration of LIbrispeech. Thx a lot！:D

zzasdf commented 4 months ago

This problem should be fixed in the newest code, the content of vars(train_dataloader) should be like this now:

{'dataset': <speech_dataset.py.SpeechDatasetJsonl object at 0x7f9ed8dc3f70>, 'num_workers': 4, 'prefetch_factor': 2, 'pin_memory': True, 'pin_memory_device': '', 'timeout': 0, 'work
er_init_fn': None, '_DataLoader__multiprocessing_context': None, '_dataset_kind': 0, 'batch_size': 4, 'drop_last': True, 'sampler': <torch.utils.data.distributed.DistributedSampler 
object at 0x7f9f388d73d0>, 'batch_sampler': <torch.utils.data.sampler.BatchSampler object at 0x7f9f388d7460>, 'generator': None, 'collate_fn': <bound method SpeechDatasetJsonl.colla
tor of <speech_dataset.py.SpeechDatasetJsonl object at 0x7f9ed8dc3f70>>, 'persistent_workers': False, '_DataLoader__initialized': True, '_IterableDataset_len_called': None, '_iterat
or': None}

lzl-mt commented 3 months ago

This problem should be fixed in the newest code, the content of vars(train_dataloader) should be like this now:

{'dataset': <speech_dataset.py.SpeechDatasetJsonl object at 0x7f9ed8dc3f70>, 'num_workers': 4, 'prefetch_factor': 2, 'pin_memory': True, 'pin_memory_device': '', 'timeout': 0, 'work
er_init_fn': None, '_DataLoader__multiprocessing_context': None, '_dataset_kind': 0, 'batch_size': 4, 'drop_last': True, 'sampler': <torch.utils.data.distributed.DistributedSampler 
object at 0x7f9f388d73d0>, 'batch_sampler': <torch.utils.data.sampler.BatchSampler object at 0x7f9f388d7460>, 'generator': None, 'collate_fn': <bound method SpeechDatasetJsonl.colla
tor of <speech_dataset.py.SpeechDatasetJsonl object at 0x7f9ed8dc3f70>>, 'persistent_workers': False, '_DataLoader__initialized': True, '_IterableDataset_len_called': None, '_iterat
or': None}

Thanks!

X-LANCE / SLAM-LLM