Closed floatingbigcat closed 1 year ago
Yeah, this is a common problem I face when training GPT NeoX without webdataset as well. I usually just try different num_workers and one of them works well (sometime even setting num_workers=2) while also giving high throughput. But yeah, we should figure out the source of the segfault at some point :)
Description If we set the num_workers of dataloader larger than 0, sometime this error occours when fetching the data 'batch = next(self.data_iterator)' https://github.com/floatingsnake/gpt-neox/blob/magma/megatron/data/webdataset.py#L371
This seems bug from pytorch side. might related to https://github.com/pytorch/pytorch/issues/91245
Solution Set the num_workers=0 will fix.
we may need to increase the num_workers if this drops speed too much on large model training. We can try different num_workers when needed as this happens occasionally
Environment torch 1.13.0a0+gitbb7fd1f torchtyping 0.1.4 torchvision 0.15.0a0+035d99f webdataset 0.2.48