OOM wav2vec finetuning multi-gpus

facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

MIT License

30.2k stars 6.38k forks source link

OOM wav2vec finetuning multi-gpus #4233

Open ddoron9 opened 2 years ago

ddoron9 commented 2 years ago

❓ Questions and Help

What is your question?

I'm getting oom while training wav2vec with multi-gpus environments and it freeze I guess. It recovers when I run with single gpu.

NCCL_P2P_DISABLE=1 fairseq-hydra-train \ task.data=/my/path/to/transcriptions \ checkpoint.save_dir=/save/dir/ \ distributed_training.distributed_world_size=4 \ dataset.num_workers=20 \ dataset.max_tokens=1280000 \ common.tensorboard_logdir=/log/dir/ \ common.empty_cache_freq=100 \ common.memory_efficient_fp16=True \ model.w2v_path=/checkpoint/path/checkpoint_best.pt \ --config-dir config/finetuning \ --config-name vox_960h.yaml

What's your environment?

fairseq Version (e.g., 1.0 or main): 1.0.0a0+c8a0659
PyTorch Version (e.g., 1.0) : 1.10.2+cu113
OS (e.g., Linux): linux
How you installed fairseq (pip, source): pip
Build command you used (if compiling from source):
Python version: 3.8.10
CUDA/cuDNN version: 11.3 / 8.2.1
GPU models and configuration: NVIDIA RTX A6000 / 49140MiB *4
Any other relevant information: nvidia driver 510.47.03

ddoron9 commented 2 years ago

I don't know why it gets oom in multi gpus training large model, but I've reduced gpu memory usage by reducing dataset.data_buffer_size=3 which is 10 by default. then it uses about 20gb while training in device 1,2 in the pictures. I'm assuming it could be the shm issue. because I'm running the command with nccl_p2p_disable=1

ddoron9 commented 2 years ago

I just found a post about pytorch 1.10 oom issue links. could it be one of the reasons of oom?

Although I reduced the data_buffer_size, the log still shows oom issue.

So I'm trying to set environment variable from the warning I get,

2022-03-01 02:43:23 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 508.00 MiB (GPU 1; 47.54 GiB total capacity; 42.24 GiB already allocated; 117.56 MiB free; 45.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I would like to know the valid max_split_size_mb to limit the gpu memory usage of my device. should I use it in the code os.environ["PYTORCH_CUDA_ALLOC_CONF"]="max_split_size_mb:50000" ?

does max_split_size_mb means mb per gpu?(In my case 50000) or (# of gpu used for training) * 50000?

raotnameh commented 2 years ago

@ddoron9 Are you able to solve it?

olafthiele commented 2 years ago

Have you tried to reduce max_tokens?`

YooSungHyun commented 2 years ago

@olafthiele hey! i don`t understand why i have to use max_tokens, not batch_size PLZ, could you mind to explain it?

olafthiele commented 2 years ago

@YooSungHyun , it is some time since I had a look at the code, but I remember that batch_size has not the typical ML meaning for this algo and that max_tokens is the one to reduce if you run into OOM errors.