DataLoader worker is killed by signal: Segmentation fault.

CERC-AAI / multimodal

An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.

Apache License 2.0

8 stars 3 forks source link

Description If we set the num_workers of dataloader larger than 0, sometime this error occours when fetching the data 'batch = next(self.data_iterator)' https://github.com/floatingsnake/gpt-neox/blob/magma/megatron/data/webdataset.py#L371

This seems bug from pytorch side. might related to https://github.com/pytorch/pytorch/issues/91245

Solution Set the num_workers=0 will fix.

we may need to increase the num_workers if this drops speed too much on large model training. We can try different num_workers when needed as this happens occasionally

Environment torch 1.13.0a0+gitbb7fd1f torchtyping 0.1.4 torchvision 0.15.0a0+035d99f webdataset 0.2.48

CERC-AAI / multimodal

DataLoader worker is killed by signal: Segmentation fault. #8