CERC-AAI / multimodal

An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.
Apache License 2.0
8 stars 3 forks source link

DataLoader worker is killed by signal: Segmentation fault. #8

Closed floatingbigcat closed 1 year ago

floatingbigcat commented 1 year ago

Description If we set the num_workers of dataloader larger than 0, sometime this error occours when fetching the data 'batch = next(self.data_iterator)' https://github.com/floatingsnake/gpt-neox/blob/magma/megatron/data/webdataset.py#L371

This seems bug from pytorch side. might related to https://github.com/pytorch/pytorch/issues/91245

Solution Set the num_workers=0 will fix.

we may need to increase the num_workers if this drops speed too much on large model training. We can try different num_workers when needed as this happens occasionally

Environment torch 1.13.0a0+gitbb7fd1f torchtyping 0.1.4 torchvision 0.15.0a0+035d99f webdataset 0.2.48

kshitijkg commented 1 year ago

Yeah, this is a common problem I face when training GPT NeoX without webdataset as well. I usually just try different num_workers and one of them works well (sometime even setting num_workers=2) while also giving high throughput. But yeah, we should figure out the source of the segfault at some point :)