flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.9k stars 2.1k forks source link

CUDA out of memory (again) #736

Closed emrul closed 4 years ago

emrul commented 5 years ago

I previously got CUDA out of memory issues but resolved them by

  1. Ensuring no sentence is greater than 200 tokens
  2. Updating PyTorch

This worked well with my small dataset (12.5k training sentences) but now I am trying with a bigger dataset (160k sentences) and my CUDA issue is back.

This is my code currently: ` columns = {0: 'text', 1: 'ner'} data_folder = 'training/sequences' corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns) print(corpus) tag_type = 'ner' tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type) print(tag_dictionary.idx2item) embeddings = StackedEmbeddings(embeddings=[ WordEmbeddings('glove'), PooledFlairEmbeddings('news-forward', pooling='mean', only_capitalized=True, use_cache=True, chars_per_chunk=32), PooledFlairEmbeddings('news-backward', pooling='mean', only_capitalized=True, use_cache=True, chars_per_chunk=32), WordEmbeddings(embeddings="embeddings/markup2vec.wv") ])

tagger: SequenceTagger = SequenceTagger(hidden_size=128, embeddings=embeddings, tag_dictionary=tag_dictionary, tag_type=tag_type)

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/v003', max_epochs=150, mini_batch_size=192, eval_mini_batch_size=16, checkpoint=True) `

and this is the error

RuntimeError: CUDA out of memory. Tried to allocate 156.00 MiB (GPU 0; 24.00 GiB total capacity; 15.79 GiB already allocated; 142.52 MiB free; 3.81 GiB cached)

I've tried using the latest Flair (GitHub Master) and running on Windows 10 with Python 3.7 any PyTorch 1.1.0. Any ideas on what else I can try to get training to work with a large number of samples?

alanakbik commented 5 years ago

Hello @emrul we just now pushed new changes to the master branch that enable streaming data loading using the Dataset and DataLoader apis.

If you replace the lines:

columns = {0: 'text', 1: 'ner'}
data_folder = 'training/sequences'
corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns) 

with:

from flair.datasets import ColumnCorpus
columns = {0: 'text', 1: 'ner'}
data_folder = 'training/sequences'
corpus = ColumnCorpus(data_folder, columns, in_memory=False)

By setting in_memory to False you make sure that the dataset is not stored in memory, but instead always read from disk.

On another note, I would currently not recommend using the PooledFlairEmbeddings on large corpora. They are currently unoptimized and this will completely blow up your memory and lead to very large models. There is an optimized version coming that is much more memory effective (and more accurate to boot), which I am hoping to push to master in the next weeks.

aayushsanghavi commented 5 years ago

@alanakbik from flair.datasets import ColumnCorpus gives an import error.

alanakbik commented 5 years ago

Are you using the latest master branch?

aayushsanghavi commented 5 years ago

I believe so. I'm using version 0.4.1. I noticed that the last commit on the datasets.py file was 2 hours ago so I tried pip install --upgrade flair (just in case the version was upgraded) before I changed those lines but it was the latest one already.

alanakbik commented 5 years ago

Ah yes that is different from the master branch. If you install from pip you are using the latest release 0.4.1. The master branch already contains features that will be part of 0.4.2 when it releases, so the only way to use them right now is for you to work on the master branch instead. You can install master like this:

pip install git+https://github.com/zalandoresearch/flair.git
amitbcp commented 5 years ago

Hi @alanakbik

Can you please share a code-snippet for loading TextClassification corpus via data-streaming ?

alanakbik commented 5 years ago

Hello @amitbcp sure, here it is:

from flair.datasets import ClassificationCorpus
corpus = ClassificationCorpus("path/to/your/corpus/", in_memory=False)

If you set in_memory to True, it will read everything into memory like before. By setting it to False you get the streaming data loading. Everything else is the same so you can train models just like before.

emrul commented 5 years ago

@alanakbik The streaming data loader works really well on my initial tests. I have switched to using regular FlairEmbeddings but I'm looking forward to the enhanced Pooled version that supports larger datasets. Thanks for your help!

emrul commented 5 years ago

I spoke too soon - I was still using my small dataset. When using the larger one I still get OOM even though I've tried to make everything as lean as possible - my code is here: https://gist.github.com/emrul/74486783e9d750f2cb08695bf26719da

Again, any help would be really valuable.

alanakbik commented 5 years ago

Hi @emrul could you try running this code with embeddings_in_memory set to False? I.e. comment in the commented-out line.

emrul commented 5 years ago

Hi, yes - same issue however:

2019-05-27 19:23:17,479 epoch 1 - iter 0/2516 - loss 132.78053284
Traceback (most recent call last):
  File "train_seq_opt.py", line 59, in <module>
    train()
  File "train_seq_opt.py", line 54, in train
    checkpoint=True)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\trainers\trainer.py", line 197, in train
    loss = self.model.forward_loss(batch)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\sequence_tagger_model.py", line 316, in forward_loss
    features = self.forward(sentences)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\sequence_tagger_model.py", line 368, in forward
    self.embeddings.embed(sentences)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\embeddings.py", line 145, in embed
    embedding.embed(sentences)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\embeddings.py", line 76, in embed
    self._add_embeddings_internal(sentences)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\embeddings.py", line 1248, in _add_embeddings_internal
    sentences_padded, self.chars_per_chunk
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\language_model.py", line 130, in get_representation
    prediction, rnn_output, hidden = self.forward(batch, hidden)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\language_model.py", line 78, in forward
    output, hidden = self.rnn(emb, hidden)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\rnn.py", line 559, in forward
    return self.forward_tensor(input, hx)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\rnn.py", line 539, in forward_tensor
    output, hidden = self.forward_impl(input, hx, batch_sizes, max_batch_size, sorted_indices)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\rnn.py", line 522, in forward_impl
    self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 24.00 GiB total capacity; 19.03 GiB already allocated; 80.52 MiB free; 642.41 MiB cached)
aayushsanghavi commented 5 years ago

@alanakbik sorry to bother you again but even after looking things up I couldn't fix my problem. My sequence tagging dataset looks like -

token1   tag1
token2   tag2
... so on
tokenN   tagN

token1   tag1
token2   tag2
... so on
tokenM   tagM

Spaces separate the columns and newlines separate sentences. This was working fine earlier when I was using NLPTaskDataFetcher.load_column_corpus(). But after switching to ColumnCorpus (for handling bigger datasets) I am getting tons of this warning - ACHTUNG: An empty Sentence was created! Are there empty strings in your dataset?

and this is what I'm doing

columns = {0: 'text', 1: 'tag'}
corpus = ColumnCorpus(args.data_folder, columns, in_memory=False)

I noticed that this log has been recently added to the Sentence class so it might as well have been the case earlier but since there was no logging I assumed that my dataset format was fine. I read a bit of the code that parses the file and creates Sentence objects but I couldn't figure out why sentence objects with empty text were being created.

Can you please tell me how to fix this?

alanakbik commented 5 years ago

@aayushsanghavi yes this is an unfortunate bug that was added and then fixed yesterday in another PR. Could you test with the current master to confirm that the bug is now gone? Thanks!

alanakbik commented 5 years ago

@emrul sorry I just realized that your memory issues are CUDA related - this is different from the regular memory issues and cannot be fixed with the embeddings_in_memory setting. Your script looks good, in particular since you've set the chars_per_chunk parameter to 64 which should be small enough given you have enough CUDA memory.

The only things I could think of trying would be to reduce the mini_batch_size or further reduce the chars_per_chunk since it seems that it gets a mini-batch that is too large for the available memory. But it really is strange that the memory is not enough. Are you running only one task on this GPU or is it also being used by other processes?

aayushsanghavi commented 5 years ago

Ooh right. Because when I tried it again this morning it worked. I put num_workers=6 and mini_batch_size=8 to reduce memory usage so I didn't have to use in_memory=False

aayushsanghavi commented 5 years ago

Hey @alanakbik whatever changes you guys pushed recently, they're pretty good. The memory management has improved so much. Earlier my model used to take up more than 12.5 GB of RAM but now it trains in about 7 GB of RAM. Nice work!

emrul commented 5 years ago

@alanakbik Nothing else is using the GPU - the entire machine is dedicated for this task. Its an Nvidia Quadro P6000 so has a lot of memory. I've reduced mini-batch size to 32 and still the issue persists.

I'll go through my training data to see if there's anything odd in there but other than that I am stumped.

alanakbik commented 5 years ago

@emrul sorry for the late reply (was busy attending NAACL) - have you made progress on this? What if you set the mini-batch size to something very low, like 2?

alanakbik commented 5 years ago

@aayushsanghavi thanks :) we hope to optimize the process more with future releases :)

emrul commented 5 years ago

Hi @alanakbik - I think it may be an issue with my larger data set. I am investigating but I think the problem is likely to be in the dataset and not in Flair. Will update when I find something.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.