Closed peldszus closed 4 years ago
Hello @peldszus - generally, I would expect the performance of this configuration to be fairly poor. For most tasks, FlairEmbeddings should generally be used in a stack with word embeddings since from our observations they tend to be good at modeling syntax, morphology and shallow semantics, but for full text classification you typically require more explicit word-level semantics. That said, there are a few things you could try here:
By default, DocumentPoolEmbeddings
only do linear fine-tuning. You could try non-linear fine-tuning which is more appropriate if the embedding space is far from the actual task which it might be in this case. Pass the parameter fine_tune_mode='nonlinear'
in the constructor of DocumentPoolEmbeddings
to enable this.
70 min for one epoch seems extremely long, though this may have something to do with the small mini batch size. However, there should also almost be no limitation on your mini_batch_size given that you have 8GB of GPU memory since long batches are actually split up in the forward pass of the FlairEmbeddings
to fit into GPU memory. You could try reducing the chars_per_chunk parameter to reduce the size of one chunk, like so:
document_embeddings = DocumentPoolEmbeddings([
FlairEmbeddings('news-forward-fast', chars_per_chunk=128),
FlairEmbeddings('news-backward-fast', chars_per_chunk=128),
], fine_tune_mode='nonlinear')
And then increase mini-batch size to 32 or 64. In addition, after the first epoch, embeddings are kept in memory so typically from the second epoch all epochs should run faster. Is this not happening on your end?
You could also try DocumentRNNEmbeddings
which in our use cases always outperform the pooling alternatives.
You could reduce patience to 1 since more will likely not be needed.
Hi Alan, thanks for the quick response!
Regarding more complex models (non-linear fine-tuning in pooling, DocumentRNNEmbeddings), I was aware of these options, but I wanted to start with the simpler ones, also given the training speed I was experiencing. But I'll try this for sure. :)
In addition, after the first epoch, embeddings are kept in memory so typically from the second epoch all epochs should run faster. Is this not happening on your end?
Unfortunately not. From the first to the last epoch, training took 40min and evaluation 30min, see the training log that I attached above.
I now set chars_per_chunk=128
:
RuntimeError: CUDA out of memory
quite immediately.I'll give an update on this tomorrow. Maybe also with some trials with chars_per_chunk=64
, then. It's already late. :)
PS: The corpus is very likely to contain longer sequences. I guess this problem is fully solved by chars_per_chunk
? So cropping to a max sequence length as in https://github.com/zalandoresearch/flair/issues/685#issuecomment-490466969 is not necessary?
It should be solved in the sense that the RNN should no longer give OOM errors since it is fed only one chunk at a time. Theoretically, this allows for sequences of unlimited length (during prediction). However, after this there is a step in which the output chunks get concatenated into a tensor. If the resulting tensor is so large that it doesn't fit into memory there may still be a CUDA OOM error. Should this happen, please let me know - we are in the process of refactoring for efficiency so this type of feedback would be helpful.
Another thing to try wrt speed would be to use the current master branch since 0.4.2 at this point is already over a month old and we've recently pushed some changes in memory management. You can install current master branch via:
pip install --upgrade git+https://github.com/zalandoresearch/flair.git
There, you can set an embedding_storage_mode
in the ModelTrainer
that can be one of 'cpu', 'gpu' and 'none'. It defaults to 'cpu' which should be best for most users. We'd be very happy to get feedback on this wrt speed performance.
Ok, I'm now on git master (8b50e2e). Ubuntu 18.04, this time at home with a GTX 1070 ti 8gb.
Here is what I observe:
With this working configuration the first and second epoch both took each 1h training and 7min evaluation (which was on dev only, not on the 7x larger test set).
Since I set epochs to 2, the best model was then loaded for testing in the end ... which ended up causing OOM.
All these runs were with the default embedding_storage_mode=cpu
. I'll try the gpu mode tomorrow.
PS: The model topology overview printed out in the beginning is a nice thing. :+1:
Ok thanks for posting this. I'll see if I can reproduce!
Hello @peldszus I've looked into the dataset and I did not realize how huge each data point is (essentially each data point is a whole document). I was also getting OOM errors, but realized that this happened in a mini-batch with a data point that has over 60000 characters (causing every other data point in the mini-batch to be padded to this length). So the above-mentioned problem of the result tensor not even fitting into memory did in fact occur.
This is in theory fixable by keeping the cat operation on GPU, but this will not be fast and maybe for data points of this size we need to find an entirely different solution. So in this use case, given that there are so many words per document, normal word embeddings should probably do the trick and will just be much faster.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi!
I'm experimenting with training a TextClassifier on the 20 newsgroups dataset. My interest is the performance of the pure flair embeddings, especially.
With my example code, though, neither speed nor performance are convincing yet, see below. But I most likely overlooked something obvious.
I'd be happy if you could share references/experiences with using the flair library on this dataset regarding both speed (necessary training time) and performance (F1).
This is the code I ran on a GTX 1080 8gb, with flair==0.4.2:
Some observations:
Here just the first and last epochs results, for more see the attached log. training.log
I'd be happy if you could shed some light on this. Thanks!