NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.43k stars 1.4k forks source link

GPU memory leak with Flair and APEX #1744

Open astropic opened 1 year ago

astropic commented 1 year ago

Describe the Bug

When I train a NER Flair model with APEX, the GPU memory keeps going higher until runs out. The same behaviour doesn't happen when I don't use APEX.

import flair

from flair.datasets import ColumnCorpus
from flair.data import Corpus
import torch
from flair.embeddings import WordEmbeddings, StackedEmbeddings, FlairEmbeddings, TokenEmbeddings
from typing import List
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

#### Embeddings
embedding_types : List[TokenEmbeddings] = [

        #### other embeddings
        FlairEmbeddings('pt-forward'),
        FlairEmbeddings('pt-backward')
        ]

embeddings : StackedEmbeddings = StackedEmbeddings(
                                 embeddings=embedding_types)

#### Tagger
tag_dictionary = corpus.make_label_dictionary('ner')
tag_type = 'ner'

tagger : SequenceTagger = SequenceTagger(hidden_size = 256,
                                         embeddings = embeddings,
                                         tag_dictionary = tag_dictionary,
                                         tag_type = tag_type,
                                         use_crf = True)

#### Trainer
trainer : ModelTrainer = ModelTrainer(tagger, corpus)
trainer.model = amp.initialize(trainer.model, opt_level="O3") #O1 same problem

#### Dataset not provided (1000 rows)
trainer.train('resources/taggers/example-ner',
              learning_rate=0.1,
              mini_batch_size=1,
              max_epochs=150,
              embeddings_storage_mode = 'CPU')

During the training:

2023-10-21 13:13:39,450 epoch 1 - iter 100/1000 - loss 0.34589515 - time (sec): 41.12 - samples/sec: 571.66 - lr: 0.100000
2023-10-21 13:14:17,702 epoch 1 - iter 200/1000 - loss 0.28670060 - time (sec): 79.37 - samples/sec: 574.87 - lr: 0.100000
2023-10-21 13:14:51,537 epoch 1 - iter 300/1000 - loss 0.23959376 - time (sec): 113.20 - samples/sec: 574.21 - lr: 0.100000
2023-10-21 13:15:27,849 epoch 1 - iter 400/1000 - loss 0.21258851 - time (sec): 149.51 - samples/sec: 571.58 - lr: 0.100000
2023-10-21 13:16:05,560 epoch 1 - iter 500/1000 - loss 0.19750254 - time (sec): 187.23 - samples/sec: 571.11 - lr: 0.100000
2023-10-21 13:16:37,191 epoch 1 - iter 600/1000 - loss 0.17963039 - time (sec): 218.86 - samples/sec: 569.15 - lr: 0.100000
2023-10-21 13:17:15,862 epoch 1 - iter 700/1000 - loss 0.16133230 - time (sec): 257.53 - samples/sec: 566.69 - lr: 0.100000
2023-10-21 13:17:52,794 epoch 1 - iter 800/1000 - loss 0.14376180 - time (sec): 294.46 - samples/sec: 565.10 - lr: 0.100000

And after that:

OutOfMemoryError: CUDA out of memory. Tried to allocate 46.00 MiB (GPU 0; 8.00 GiB total capacity; 7.18 GiB already
allocated; 0 bytes free; 7.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try 
setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and 
PYTORCH_CUDA_ALLOC_CONF

Expected Behavior To run the training until the end without out of memory problems in the same way it happens without APEX.

Environment

PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Home Single Language
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.4 | packaged by conda-forge | (main, Mar 30 2022, 08:38:02) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22621-SP0
Is CUDA available: True
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Laptop GPU
Nvidia driver version: 528.79
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=2304
DeviceID=CPU0
Family=198
L2CacheSize=10240
L2CacheSpeed=
Manufacturer=GenuineIntel
MaxClockSpeed=2304
Name=11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
ProcessorType=3
Revision=

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.23.5
[pip3] pytorch_revgrad==0.2.0
[pip3] torch==2.0.1+cu117
[pip3] torchaudio==2.0.2+cu117
[pip3] torchvision==0.15.2+cu117
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.3.1              h280eb24_10    conda-forge
[conda] libblas                   3.9.0              15_win64_mkl    conda-forge
[conda] libcblas                  3.9.0              15_win64_mkl    conda-forge
[conda] liblapack                 3.9.0              15_win64_mkl    conda-forge
[conda] mkl                       2022.1.0           h6a75c08_874    conda-forge
[conda] numexpr                   2.8.4           mkl_py310h98e78b8_0    conda-forge
[conda] numpy                     1.23.5                   pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-revgrad           0.2.0                    pypi_0    pypi
[conda] torch                     2.0.1+cu117              pypi_0    pypi
[conda] torchaudio                2.0.2+cu117              pypi_0    pypi
[conda] torchvision               0.15.2+cu117             pypi_0    pypi

I have seen a similar discussion in https://github.com/NVIDIA/apex/issues/439, but didn't see a practical solution for this problem.

Please help ))