When I train a NER Flair model with APEX, the GPU memory keeps going higher until runs out. The same behaviour doesn't happen when I don't use APEX.
import flair
from flair.datasets import ColumnCorpus
from flair.data import Corpus
import torch
from flair.embeddings import WordEmbeddings, StackedEmbeddings, FlairEmbeddings, TokenEmbeddings
from typing import List
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
#### Embeddings
embedding_types : List[TokenEmbeddings] = [
#### other embeddings
FlairEmbeddings('pt-forward'),
FlairEmbeddings('pt-backward')
]
embeddings : StackedEmbeddings = StackedEmbeddings(
embeddings=embedding_types)
#### Tagger
tag_dictionary = corpus.make_label_dictionary('ner')
tag_type = 'ner'
tagger : SequenceTagger = SequenceTagger(hidden_size = 256,
embeddings = embeddings,
tag_dictionary = tag_dictionary,
tag_type = tag_type,
use_crf = True)
#### Trainer
trainer : ModelTrainer = ModelTrainer(tagger, corpus)
trainer.model = amp.initialize(trainer.model, opt_level="O3") #O1 same problem
#### Dataset not provided (1000 rows)
trainer.train('resources/taggers/example-ner',
learning_rate=0.1,
mini_batch_size=1,
max_epochs=150,
embeddings_storage_mode = 'CPU')
During the training:
2023-10-21 13:13:39,450 epoch 1 - iter 100/1000 - loss 0.34589515 - time (sec): 41.12 - samples/sec: 571.66 - lr: 0.100000
2023-10-21 13:14:17,702 epoch 1 - iter 200/1000 - loss 0.28670060 - time (sec): 79.37 - samples/sec: 574.87 - lr: 0.100000
2023-10-21 13:14:51,537 epoch 1 - iter 300/1000 - loss 0.23959376 - time (sec): 113.20 - samples/sec: 574.21 - lr: 0.100000
2023-10-21 13:15:27,849 epoch 1 - iter 400/1000 - loss 0.21258851 - time (sec): 149.51 - samples/sec: 571.58 - lr: 0.100000
2023-10-21 13:16:05,560 epoch 1 - iter 500/1000 - loss 0.19750254 - time (sec): 187.23 - samples/sec: 571.11 - lr: 0.100000
2023-10-21 13:16:37,191 epoch 1 - iter 600/1000 - loss 0.17963039 - time (sec): 218.86 - samples/sec: 569.15 - lr: 0.100000
2023-10-21 13:17:15,862 epoch 1 - iter 700/1000 - loss 0.16133230 - time (sec): 257.53 - samples/sec: 566.69 - lr: 0.100000
2023-10-21 13:17:52,794 epoch 1 - iter 800/1000 - loss 0.14376180 - time (sec): 294.46 - samples/sec: 565.10 - lr: 0.100000
And after that:
OutOfMemoryError: CUDA out of memory. Tried to allocate 46.00 MiB (GPU 0; 8.00 GiB total capacity; 7.18 GiB already
allocated; 0 bytes free; 7.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF
Expected Behavior
To run the training until the end without out of memory problems in the same way it happens without APEX.
Environment
PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 11 Home Single Language
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A
Python version: 3.10.4 | packaged by conda-forge | (main, Mar 30 2022, 08:38:02) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22621-SP0
Is CUDA available: True
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Laptop GPU
Nvidia driver version: 528.79
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture=9
CurrentClockSpeed=2304
DeviceID=CPU0
Family=198
L2CacheSize=10240
L2CacheSpeed=
Manufacturer=GenuineIntel
MaxClockSpeed=2304
Name=11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
ProcessorType=3
Revision=
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.23.5
[pip3] pytorch_revgrad==0.2.0
[pip3] torch==2.0.1+cu117
[pip3] torchaudio==2.0.2+cu117
[pip3] torchvision==0.15.2+cu117
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.3.1 h280eb24_10 conda-forge
[conda] libblas 3.9.0 15_win64_mkl conda-forge
[conda] libcblas 3.9.0 15_win64_mkl conda-forge
[conda] liblapack 3.9.0 15_win64_mkl conda-forge
[conda] mkl 2022.1.0 h6a75c08_874 conda-forge
[conda] numexpr 2.8.4 mkl_py310h98e78b8_0 conda-forge
[conda] numpy 1.23.5 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch-revgrad 0.2.0 pypi_0 pypi
[conda] torch 2.0.1+cu117 pypi_0 pypi
[conda] torchaudio 2.0.2+cu117 pypi_0 pypi
[conda] torchvision 0.15.2+cu117 pypi_0 pypi
Describe the Bug
When I train a NER Flair model with APEX, the GPU memory keeps going higher until runs out. The same behaviour doesn't happen when I don't use APEX.
During the training:
And after that:
Expected Behavior To run the training until the end without out of memory problems in the same way it happens without APEX.
Environment
I have seen a similar discussion in https://github.com/NVIDIA/apex/issues/439, but didn't see a practical solution for this problem.
Please help ))