bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.3k stars 211 forks source link

Fix/dataloader error #384

Closed EastInsure closed 1 year ago

EastInsure commented 1 year ago

fix dataloader error:

File "/mnt/Megatron-DeepSpeed/pretrain_gpt.py", line 130, in get_batch_pipe
attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
File "/mnt/Megatron-DeepSpeed/megatron/utils.py", line 180, in get_ltor_masks_and_position_ids
attention_mask = torch.tril(torch.ones(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 37631.79 GiB (GPU 3; 79.21 GiB total capacity; 50.52 GiB already allocated; 24.67 GiB free; 52.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
pt-0de9jsbu-master-0:39:39 [5] NCCL INFO comm 0x559e7ce3f680 rank 0 nranks 2 cudaDev 5 busId ad000 - Abort COMPLETE