References for speed and performance of text classification on 20 newsgroups

peldszus commented 5 years ago

Hi!

I'm experimenting with training a TextClassifier on the 20 newsgroups dataset. My interest is the performance of the pure flair embeddings, especially.

With my example code, though, neither speed nor performance are convincing yet, see below. But I most likely overlooked something obvious.

I'd be happy if you could share references/experiences with using the flair library on this dataset regarding both speed (necessary training time) and performance (F1).

This is the code I ran on a GTX 1080 8gb, with flair==0.4.2:

from flair.embeddings import FlairEmbeddings, DocumentPoolEmbeddings
from flair.datasets import NEWSGROUPS
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

corpus = NEWSGROUPS()
label_dict = corpus.make_label_dictionary()

document_embeddings = DocumentPoolEmbeddings([
    FlairEmbeddings('news-forward-fast'),
    FlairEmbeddings('news-backward-fast'),
])
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)

trainer = ModelTrainer(classifier, corpus)
trainer.train(
    'resources/clfs/20ngs',
    learning_rate=0.1,
    mini_batch_size=4,
    anneal_factor=0.5,
    patience=5,
    max_epochs=150,
)

Some observations:

The mini_batch_size needs to be quite small, otherwise it would not be possible to train and evaluate one epoch with only 8gb GPU RAM.
One epoch takes about 40min to train and 30min (!) to evaluate.
Even after training for nearly 4 days, the results are still pretty bad. (This is somewhat astonishing to me, as simple models can get into the 80% easily in a few minutes on CPU.) I understand that contextual string embeddings are somewhat expensive for longer text input, but still minutes vs days...
The learning rate was reduced quite a few times already, which is why I wouldn't await any drastic improvements when training longer.

Here just the first and last epochs results, for more see the attached log. training.log

2019-07-12 15:31:04,284 EPOCH 1 done: loss 2.2857 - lr 0.1000 - bad epochs 0
2019-07-12 15:35:09,671 DEV : loss 2.0400612354278564 - score 0.3165
2019-07-12 16:01:18,281 TEST : loss 2.083500862121582 - score 0.2864
...
2019-07-16 06:57:08,258 EPOCH 75 done: loss 0.7980 - lr 0.0031 - bad epochs 2
2019-07-16 07:01:13,513 DEV : loss 1.1750359535217285 - score 0.6463
2019-07-16 07:27:19,122 TEST : loss 1.416900396347046 - score 0.5794

I'd be happy if you could shed some light on this. Thanks!

alanakbik commented 5 years ago

Hello @peldszus - generally, I would expect the performance of this configuration to be fairly poor. For most tasks, FlairEmbeddings should generally be used in a stack with word embeddings since from our observations they tend to be good at modeling syntax, morphology and shallow semantics, but for full text classification you typically require more explicit word-level semantics. That said, there are a few things you could try here:

By default, DocumentPoolEmbeddings only do linear fine-tuning. You could try non-linear fine-tuning which is more appropriate if the embedding space is far from the actual task which it might be in this case. Pass the parameter fine_tune_mode='nonlinear' in the constructor of DocumentPoolEmbeddings to enable this.
70 min for one epoch seems extremely long, though this may have something to do with the small mini batch size. However, there should also almost be no limitation on your mini_batch_size given that you have 8GB of GPU memory since long batches are actually split up in the forward pass of the FlairEmbeddings to fit into GPU memory. You could try reducing the chars_per_chunk parameter to reduce the size of one chunk, like so:

document_embeddings = DocumentPoolEmbeddings([
    FlairEmbeddings('news-forward-fast', chars_per_chunk=128),
    FlairEmbeddings('news-backward-fast', chars_per_chunk=128),
], fine_tune_mode='nonlinear')

And then increase mini-batch size to 32 or 64. In addition, after the first epoch, embeddings are kept in memory so typically from the second epoch all epochs should run faster. Is this not happening on your end?

You could also try DocumentRNNEmbeddings which in our use cases always outperform the pooling alternatives.
You could reduce patience to 1 since more will likely not be needed.

peldszus commented 5 years ago

Hi Alan, thanks for the quick response!

Regarding more complex models (non-linear fine-tuning in pooling, DocumentRNNEmbeddings), I was aware of these options, but I wanted to start with the simpler ones, also given the training speed I was experiencing. But I'll try this for sure. :)

In addition, after the first epoch, embeddings are kept in memory so typically from the second epoch all epochs should run faster. Is this not happening on your end?

Unfortunately not. From the first to the last epoch, training took 40min and evaluation 30min, see the training log that I attached above.

I now set chars_per_chunk=128:

When then raising the mini-batch size to 32, I get RuntimeError: CUDA out of memory quite immediately.
With a mini-batch size of 16, I get the memory error after a minute.
With a mini-batch size of 8, it's currently running through the first half of epoch 1, consuming roughly 6 of 8gb GPU memory. The last time I tried this (without chars_per_chunk) it finished the first epoch's training, but then raised the memory error when doing the evaluation. Edit: Turned out as assumed: Evaluation on dev was successful, but before seeing the result on test, the memory errors is raised.

I'll give an update on this tomorrow. Maybe also with some trials with chars_per_chunk=64, then. It's already late. :)

peldszus commented 5 years ago

PS: The corpus is very likely to contain longer sequences. I guess this problem is fully solved by chars_per_chunk? So cropping to a max sequence length as in https://github.com/zalandoresearch/flair/issues/685#issuecomment-490466969 is not necessary?

alanakbik commented 5 years ago

It should be solved in the sense that the RNN should no longer give OOM errors since it is fed only one chunk at a time. Theoretically, this allows for sequences of unlimited length (during prediction). However, after this there is a step in which the output chunks get concatenated into a tensor. If the resulting tensor is so large that it doesn't fit into memory there may still be a CUDA OOM error. Should this happen, please let me know - we are in the process of refactoring for efficiency so this type of feedback would be helpful.

Another thing to try wrt speed would be to use the current master branch since 0.4.2 at this point is already over a month old and we've recently pushed some changes in memory management. You can install current master branch via:

pip install --upgrade git+https://github.com/zalandoresearch/flair.git

There, you can set an embedding_storage_mode in the ModelTrainer that can be one of 'cpu', 'gpu' and 'none'. It defaults to 'cpu' which should be best for most users. We'd be very happy to get feedback on this wrt speed performance.

peldszus commented 5 years ago

Ok, I'm now on git master (8b50e2e). Ubuntu 18.04, this time at home with a GTX 1070 ti 8gb.

Here is what I observe:

chars_per_chunk=default, mini_batch_size=32 -> immediate OOM
chars_per_chunk=64, mini_batch_size=32 -> very early OOM
chars_per_chunk=64, mini_batch_size=16 -> very early OOM
chars_per_chunk=64, mini_batch_size=8 -> works now

With this working configuration the first and second epoch both took each 1h training and 7min evaluation (which was on dev only, not on the 7x larger test set).

Since I set epochs to 2, the best model was then loaded for testing in the end ... which ended up causing OOM.

All these runs were with the default embedding_storage_mode=cpu. I'll try the gpu mode tomorrow.

PS: The model topology overview printed out in the beginning is a nice thing. :+1:

alanakbik commented 5 years ago

Ok thanks for posting this. I'll see if I can reproduce!

alanakbik commented 5 years ago

Hello @peldszus I've looked into the dataset and I did not realize how huge each data point is (essentially each data point is a whole document). I was also getting OOM errors, but realized that this happened in a mini-batch with a data point that has over 60000 characters (causing every other data point in the mini-batch to be padded to this length). So the above-mentioned problem of the result tensor not even fitting into memory did in fact occur.

This is in theory fixable by keeping the cat operation on GPU, but this will not be fast and maybe for data points of this size we need to find an entirely different solution. So in this use case, given that there are so many words per document, normal word embeddings should probably do the trick and will just be much faster.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

flairNLP / flair

References for speed and performance of text classification on 20 newsgroups #903