OOM when finetuning using multi-GPUs

What is your question?

Dear authors, thanks a lot for this great work! I'm getting OOM while finetuning avhubert on my own dataset using multi-GPUs, and this error usually happens on non initial epoch: fairseq-hydra-train --config-dir /my/config --config-name myconfig.yaml hydra.run.dir=../saved_model/20220311_1 common.user_dir=`pwd` distributed_training.ddp_backend=c10d distributed_training.distributed_world_size=4 distributed_training.nprocs_per_node=4
The OOM happens randomly on one GPU:
2022-03-18 21:04:26 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 2; 22.38 GiB total capacity; 21.16 GiB already allocated; 19.94 MiB free; 21.54 GiB reserved in total by PyTorch)
2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 2                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 1            |        cudaMalloc retries: 8         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   21660 MB |   21678 MB |   93887 GB |   93865 GB |
|       from large pool |   21640 MB |   21663 MB |   93001 GB |   92980 GB |
|       from small pool |      19 MB |      19 MB |     885 GB |     885 GB |
|---------------------------------------------------------------------------|
| Active memory         |   21660 MB |   21678 MB |   93887 GB |   93865 GB |
|       from large pool |   21640 MB |   21663 MB |   93001 GB |   92980 GB |
|       from small pool |      19 MB |      19 MB |     885 GB |     885 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   22062 MB |   22078 MB |   61016 MB |   38954 MB |
|       from large pool |   22040 MB |   22060 MB |   60488 MB |   38448 MB |
|       from small pool |      22 MB |     176 MB |     528 MB |     506 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |  411642 KB |    7842 MB |  189965 GB |  189965 GB |
|       from large pool |  409546 KB |    7828 MB |  188976 GB |  188976 GB |
|       from small pool |    2096 KB |      14 MB |     989 GB |     989 GB |
|---------------------------------------------------------------------------|
| Allocations           |    1810    |    1879    |   28459 K  |   28457 K  |
|       from large pool |     660    |     662    |    6158 K  |    6157 K  |
|       from small pool |    1150    |    1299    |   22300 K  |   22299 K  |
|---------------------------------------------------------------------------|
| Active allocs         |    1810    |    1879    |   28459 K  |   28457 K  |
|       from large pool |     660    |     662    |    6158 K  |    6157 K  |
|       from small pool |    1150    |    1299    |   22300 K  |   22299 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |     173    |     244    |     572    |     399    |
|       from large pool |     162    |     163    |     308    |     146    |
|       from small pool |      11    |      88    |     264    |     253    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |     144    |     214    |    9561 K  |    9560 K  |
|       from large pool |     135    |     161    |    2333 K  |    2333 K  |
|       from small pool |       9    |      58    |    7227 K  |    7227 K  |
|===========================================================================|

2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 3                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|
I have tried to use no_c10d and pytorch_ddp as ddp_backend and tried to downgrade pytorch to 1.9.1 or 1.8.0 according to this issue and also have checked my dataset (using max_tokens instead of batch_size to prevent long sentence) , but these didn't work for me.
What's your environment?

fairseq Version : 1.0.0a0
PyTorch Version (e.g., 1.0) : 1.10.0
OS : Ubuntu 20.04.2 LTS
How you installed fairseq (pip, source): pip
Python version: 3.8.12
CUDA version: 10.1
GPU models and configuration: NVIDIA Tesla P40 / 22919MiB *4
Any other relevant information: NVIDIA Driver Version: 470.94
Thanks in advance for your comment!
All the best, An Hsu
facebookresearch / av_hubert

OOM when finetuning using multi-GPUs #35

What is your question?

What's your environment?