Closed lzl-mt closed 3 months ago
This problem should be fixed in the newest code, the content of vars(train_dataloader) should be like this now:
{'dataset': <speech_dataset.py.SpeechDatasetJsonl object at 0x7f9ed8dc3f70>, 'num_workers': 4, 'prefetch_factor': 2, 'pin_memory': True, 'pin_memory_device': '', 'timeout': 0, 'work
er_init_fn': None, '_DataLoader__multiprocessing_context': None, '_dataset_kind': 0, 'batch_size': 4, 'drop_last': True, 'sampler': <torch.utils.data.distributed.DistributedSampler
object at 0x7f9f388d73d0>, 'batch_sampler': <torch.utils.data.sampler.BatchSampler object at 0x7f9f388d7460>, 'generator': None, 'collate_fn': <bound method SpeechDatasetJsonl.colla
tor of <speech_dataset.py.SpeechDatasetJsonl object at 0x7f9ed8dc3f70>>, 'persistent_workers': False, '_DataLoader__initialized': True, '_IterableDataset_len_called': None, '_iterat
or': None}
This problem should be fixed in the newest code, the content of vars(train_dataloader) should be like this now:
{'dataset': <speech_dataset.py.SpeechDatasetJsonl object at 0x7f9ed8dc3f70>, 'num_workers': 4, 'prefetch_factor': 2, 'pin_memory': True, 'pin_memory_device': '', 'timeout': 0, 'work er_init_fn': None, '_DataLoader__multiprocessing_context': None, '_dataset_kind': 0, 'batch_size': 4, 'drop_last': True, 'sampler': <torch.utils.data.distributed.DistributedSampler object at 0x7f9f388d73d0>, 'batch_sampler': <torch.utils.data.sampler.BatchSampler object at 0x7f9f388d7460>, 'generator': None, 'collate_fn': <bound method SpeechDatasetJsonl.colla tor of <speech_dataset.py.SpeechDatasetJsonl object at 0x7f9ed8dc3f70>>, 'persistent_workers': False, '_DataLoader__initialized': True, '_IterableDataset_len_called': None, '_iterat or': None}
Thanks!
System Info
torch 2.0.1 torchaudio 2.0.2 torchvision 0.15.2
Information
🐛 Describe the bug
When using Deepspeed training, compared with DDP training with the same configuration, the total number of steps in each epoch training increased by N times (N is the number of cards). When printing the relevant configuration of the dataset, it was found that there is no sampler. DDP: {'sampler': <torch.utils.data.distributed.DistributedSampler object at 0x7fc99032c640>, 'batch_size': 6, 'drop_last': True, 'collate_fn': <bound method SpeechDatasetJsonl.collator of <speech_dataset.py.SpeechDatasetJsonl object at 0x7fc275f34130>>} Deepspeed: {'batch_size': 6, 'drop_last': True, 'collate_fn': <bound method SpeechDatasetJsonl.collator of <speech_dataset.py.SpeechDatasetJsonl object at 0x7fbee2e324c0>>} It may cause that the data read by each card is exactly the same.
Error logs
the same as above
Expected behavior
Thank you for your outstanding work. Hope can fix this problem and give the time required for Deepspeed and DDP to train an epoch on the default configuration of LIbrispeech. Thx a lot!:D