Closed emrul closed 4 years ago
Hello @emrul we just now pushed new changes to the master branch that enable streaming data loading using the Dataset
and DataLoader
apis.
If you replace the lines:
columns = {0: 'text', 1: 'ner'}
data_folder = 'training/sequences'
corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns)
with:
from flair.datasets import ColumnCorpus
columns = {0: 'text', 1: 'ner'}
data_folder = 'training/sequences'
corpus = ColumnCorpus(data_folder, columns, in_memory=False)
By setting in_memory
to False
you make sure that the dataset is not stored in memory, but instead always read from disk.
On another note, I would currently not recommend using the PooledFlairEmbeddings
on large corpora. They are currently unoptimized and this will completely blow up your memory and lead to very large models. There is an optimized version coming that is much more memory effective (and more accurate to boot), which I am hoping to push to master in the next weeks.
@alanakbik from flair.datasets import ColumnCorpus
gives an import error.
Are you using the latest master branch?
I believe so. I'm using version 0.4.1. I noticed that the last commit on the datasets.py file was 2 hours ago so I tried pip install --upgrade flair
(just in case the version was upgraded) before I changed those lines but it was the latest one already.
Ah yes that is different from the master branch. If you install from pip you are using the latest release 0.4.1. The master branch already contains features that will be part of 0.4.2 when it releases, so the only way to use them right now is for you to work on the master branch instead. You can install master like this:
pip install git+https://github.com/zalandoresearch/flair.git
Hi @alanakbik
Can you please share a code-snippet for loading TextClassification corpus via data-streaming ?
Hello @amitbcp sure, here it is:
from flair.datasets import ClassificationCorpus
corpus = ClassificationCorpus("path/to/your/corpus/", in_memory=False)
If you set in_memory
to True
, it will read everything into memory like before. By setting it to False
you get the streaming data loading. Everything else is the same so you can train models just like before.
@alanakbik The streaming data loader works really well on my initial tests. I have switched to using regular FlairEmbeddings but I'm looking forward to the enhanced Pooled version that supports larger datasets. Thanks for your help!
I spoke too soon - I was still using my small dataset. When using the larger one I still get OOM even though I've tried to make everything as lean as possible - my code is here: https://gist.github.com/emrul/74486783e9d750f2cb08695bf26719da
Again, any help would be really valuable.
Hi @emrul could you try running this code with embeddings_in_memory
set to False
? I.e. comment in the commented-out line.
Hi, yes - same issue however:
2019-05-27 19:23:17,479 epoch 1 - iter 0/2516 - loss 132.78053284
Traceback (most recent call last):
File "train_seq_opt.py", line 59, in <module>
train()
File "train_seq_opt.py", line 54, in train
checkpoint=True)
File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\trainers\trainer.py", line 197, in train
loss = self.model.forward_loss(batch)
File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\sequence_tagger_model.py", line 316, in forward_loss
features = self.forward(sentences)
File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\sequence_tagger_model.py", line 368, in forward
self.embeddings.embed(sentences)
File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\embeddings.py", line 145, in embed
embedding.embed(sentences)
File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\embeddings.py", line 76, in embed
self._add_embeddings_internal(sentences)
File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\embeddings.py", line 1248, in _add_embeddings_internal
sentences_padded, self.chars_per_chunk
File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\language_model.py", line 130, in get_representation
prediction, rnn_output, hidden = self.forward(batch, hidden)
File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\language_model.py", line 78, in forward
output, hidden = self.rnn(emb, hidden)
File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\rnn.py", line 559, in forward
return self.forward_tensor(input, hx)
File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\rnn.py", line 539, in forward_tensor
output, hidden = self.forward_impl(input, hx, batch_sizes, max_batch_size, sorted_indices)
File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\rnn.py", line 522, in forward_impl
self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 24.00 GiB total capacity; 19.03 GiB already allocated; 80.52 MiB free; 642.41 MiB cached)
@alanakbik sorry to bother you again but even after looking things up I couldn't fix my problem. My sequence tagging dataset looks like -
token1 tag1
token2 tag2
... so on
tokenN tagN
token1 tag1
token2 tag2
... so on
tokenM tagM
Spaces separate the columns and newlines separate sentences. This was working fine earlier when I was using NLPTaskDataFetcher.load_column_corpus(). But after switching to ColumnCorpus (for handling bigger datasets) I am getting tons of this warning -
ACHTUNG: An empty Sentence was created! Are there empty strings in your dataset?
and this is what I'm doing
columns = {0: 'text', 1: 'tag'}
corpus = ColumnCorpus(args.data_folder, columns, in_memory=False)
I noticed that this log has been recently added to the Sentence class so it might as well have been the case earlier but since there was no logging I assumed that my dataset format was fine. I read a bit of the code that parses the file and creates Sentence objects but I couldn't figure out why sentence objects with empty text were being created.
Can you please tell me how to fix this?
@aayushsanghavi yes this is an unfortunate bug that was added and then fixed yesterday in another PR. Could you test with the current master to confirm that the bug is now gone? Thanks!
@emrul sorry I just realized that your memory issues are CUDA related - this is different from the regular memory issues and cannot be fixed with the embeddings_in_memory
setting. Your script looks good, in particular since you've set the chars_per_chunk
parameter to 64 which should be small enough given you have enough CUDA memory.
The only things I could think of trying would be to reduce the mini_batch_size
or further reduce the chars_per_chunk
since it seems that it gets a mini-batch that is too large for the available memory. But it really is strange that the memory is not enough. Are you running only one task on this GPU or is it also being used by other processes?
Ooh right. Because when I tried it again this morning it worked. I put num_workers=6 and mini_batch_size=8
to reduce memory usage so I didn't have to use in_memory=False
Hey @alanakbik whatever changes you guys pushed recently, they're pretty good. The memory management has improved so much. Earlier my model used to take up more than 12.5 GB of RAM but now it trains in about 7 GB of RAM. Nice work!
@alanakbik Nothing else is using the GPU - the entire machine is dedicated for this task. Its an Nvidia Quadro P6000 so has a lot of memory. I've reduced mini-batch size to 32 and still the issue persists.
I'll go through my training data to see if there's anything odd in there but other than that I am stumped.
@emrul sorry for the late reply (was busy attending NAACL) - have you made progress on this? What if you set the mini-batch size to something very low, like 2?
@aayushsanghavi thanks :) we hope to optimize the process more with future releases :)
Hi @alanakbik - I think it may be an issue with my larger data set. I am investigating but I think the problem is likely to be in the dataset and not in Flair. Will update when I find something.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I previously got CUDA out of memory issues but resolved them by
This worked well with my small dataset (12.5k training sentences) but now I am trying with a bigger dataset (160k sentences) and my CUDA issue is back.
This is my code currently: ` columns = {0: 'text', 1: 'ner'} data_folder = 'training/sequences' corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns) print(corpus) tag_type = 'ner' tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type) print(tag_dictionary.idx2item) embeddings = StackedEmbeddings(embeddings=[ WordEmbeddings('glove'), PooledFlairEmbeddings('news-forward', pooling='mean', only_capitalized=True, use_cache=True, chars_per_chunk=32), PooledFlairEmbeddings('news-backward', pooling='mean', only_capitalized=True, use_cache=True, chars_per_chunk=32), WordEmbeddings(embeddings="embeddings/markup2vec.wv") ])
tagger: SequenceTagger = SequenceTagger(hidden_size=128, embeddings=embeddings, tag_dictionary=tag_dictionary, tag_type=tag_type)
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train('resources/taggers/v003', max_epochs=150, mini_batch_size=192, eval_mini_batch_size=16, checkpoint=True) `
and this is the error
I've tried using the latest Flair (GitHub Master) and running on Windows 10 with Python 3.7 any PyTorch 1.1.0. Any ideas on what else I can try to get training to work with a large number of samples?