Open ddoron9 opened 2 years ago
I don't know why it gets oom in multi gpus training large model, but I've reduced gpu memory usage by reducing dataset.data_buffer_size=3 which is 10 by default. then it uses about 20gb while training in device 1,2 in the pictures. I'm assuming it could be the shm issue. because I'm running the command with nccl_p2p_disable=1
I just found a post about pytorch 1.10 oom issue links. could it be one of the reasons of oom?
Although I reduced the data_buffer_size, the log still shows oom issue.
So I'm trying to set environment variable from the warning I get,
2022-03-01 02:43:23 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 508.00 MiB (GPU 1; 47.54 GiB total capacity; 42.24 GiB already allocated; 117.56 MiB free; 45.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I would like to know the valid max_split_size_mb to limit the gpu memory usage of my device. should I use it in the code os.environ["PYTORCH_CUDA_ALLOC_CONF"]="max_split_size_mb:50000"
?
does max_split_size_mb means mb per gpu?(In my case 50000) or (# of gpu used for training) * 50000?
@ddoron9 Are you able to solve it?
Have you tried to reduce max_tokens?`
@olafthiele hey! i don`t understand why i have to use max_tokens, not batch_size PLZ, could you mind to explain it?
@YooSungHyun , it is some time since I had a look at the code, but I remember that batch_size has not the typical ML meaning for this algo and that max_tokens is the one to reduce if you run into OOM errors.
❓ Questions and Help
What is your question?
I'm getting oom while training wav2vec with multi-gpus environments and it freeze I guess. It recovers when I run with single gpu.
NCCL_P2P_DISABLE=1 fairseq-hydra-train \ task.data=/my/path/to/transcriptions \ checkpoint.save_dir=/save/dir/ \ distributed_training.distributed_world_size=4 \ dataset.num_workers=20 \ dataset.max_tokens=1280000 \ common.tensorboard_logdir=/log/dir/ \ common.empty_cache_freq=100 \ common.memory_efficient_fp16=True \ model.w2v_path=/checkpoint/path/checkpoint_best.pt \ --config-dir config/finetuning \ --config-name vox_960h.yaml
What's your environment?
pip
, source): pip