flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.88k stars 2.1k forks source link

References for speed and performance of text classification on 20 newsgroups #903

Closed peldszus closed 4 years ago

peldszus commented 5 years ago

Hi!

I'm experimenting with training a TextClassifier on the 20 newsgroups dataset. My interest is the performance of the pure flair embeddings, especially.

With my example code, though, neither speed nor performance are convincing yet, see below. But I most likely overlooked something obvious.

I'd be happy if you could share references/experiences with using the flair library on this dataset regarding both speed (necessary training time) and performance (F1).

This is the code I ran on a GTX 1080 8gb, with flair==0.4.2:

from flair.embeddings import FlairEmbeddings, DocumentPoolEmbeddings
from flair.datasets import NEWSGROUPS
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

corpus = NEWSGROUPS()
label_dict = corpus.make_label_dictionary()

document_embeddings = DocumentPoolEmbeddings([
    FlairEmbeddings('news-forward-fast'),
    FlairEmbeddings('news-backward-fast'),
])
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)

trainer = ModelTrainer(classifier, corpus)
trainer.train(
    'resources/clfs/20ngs',
    learning_rate=0.1,
    mini_batch_size=4,
    anneal_factor=0.5,
    patience=5,
    max_epochs=150,
)

Some observations:

Here just the first and last epochs results, for more see the attached log. training.log

2019-07-12 15:31:04,284 EPOCH 1 done: loss 2.2857 - lr 0.1000 - bad epochs 0
2019-07-12 15:35:09,671 DEV : loss 2.0400612354278564 - score 0.3165
2019-07-12 16:01:18,281 TEST : loss 2.083500862121582 - score 0.2864
...
2019-07-16 06:57:08,258 EPOCH 75 done: loss 0.7980 - lr 0.0031 - bad epochs 2
2019-07-16 07:01:13,513 DEV : loss 1.1750359535217285 - score 0.6463
2019-07-16 07:27:19,122 TEST : loss 1.416900396347046 - score 0.5794

I'd be happy if you could shed some light on this. Thanks!

alanakbik commented 5 years ago

Hello @peldszus - generally, I would expect the performance of this configuration to be fairly poor. For most tasks, FlairEmbeddings should generally be used in a stack with word embeddings since from our observations they tend to be good at modeling syntax, morphology and shallow semantics, but for full text classification you typically require more explicit word-level semantics. That said, there are a few things you could try here:

document_embeddings = DocumentPoolEmbeddings([
    FlairEmbeddings('news-forward-fast', chars_per_chunk=128),
    FlairEmbeddings('news-backward-fast', chars_per_chunk=128),
], fine_tune_mode='nonlinear')

And then increase mini-batch size to 32 or 64. In addition, after the first epoch, embeddings are kept in memory so typically from the second epoch all epochs should run faster. Is this not happening on your end?

peldszus commented 5 years ago

Hi Alan, thanks for the quick response!

Regarding more complex models (non-linear fine-tuning in pooling, DocumentRNNEmbeddings), I was aware of these options, but I wanted to start with the simpler ones, also given the training speed I was experiencing. But I'll try this for sure. :)

In addition, after the first epoch, embeddings are kept in memory so typically from the second epoch all epochs should run faster. Is this not happening on your end?

Unfortunately not. From the first to the last epoch, training took 40min and evaluation 30min, see the training log that I attached above.

I now set chars_per_chunk=128:

I'll give an update on this tomorrow. Maybe also with some trials with chars_per_chunk=64, then. It's already late. :)

peldszus commented 5 years ago

PS: The corpus is very likely to contain longer sequences. I guess this problem is fully solved by chars_per_chunk? So cropping to a max sequence length as in https://github.com/zalandoresearch/flair/issues/685#issuecomment-490466969 is not necessary?

alanakbik commented 5 years ago

It should be solved in the sense that the RNN should no longer give OOM errors since it is fed only one chunk at a time. Theoretically, this allows for sequences of unlimited length (during prediction). However, after this there is a step in which the output chunks get concatenated into a tensor. If the resulting tensor is so large that it doesn't fit into memory there may still be a CUDA OOM error. Should this happen, please let me know - we are in the process of refactoring for efficiency so this type of feedback would be helpful.

Another thing to try wrt speed would be to use the current master branch since 0.4.2 at this point is already over a month old and we've recently pushed some changes in memory management. You can install current master branch via:

pip install --upgrade git+https://github.com/zalandoresearch/flair.git 

There, you can set an embedding_storage_mode in the ModelTrainer that can be one of 'cpu', 'gpu' and 'none'. It defaults to 'cpu' which should be best for most users. We'd be very happy to get feedback on this wrt speed performance.

peldszus commented 5 years ago

Ok, I'm now on git master (8b50e2e). Ubuntu 18.04, this time at home with a GTX 1070 ti 8gb.

Here is what I observe:

With this working configuration the first and second epoch both took each 1h training and 7min evaluation (which was on dev only, not on the 7x larger test set).

Since I set epochs to 2, the best model was then loaded for testing in the end ... which ended up causing OOM.

All these runs were with the default embedding_storage_mode=cpu. I'll try the gpu mode tomorrow.

PS: The model topology overview printed out in the beginning is a nice thing. :+1:

alanakbik commented 5 years ago

Ok thanks for posting this. I'll see if I can reproduce!

alanakbik commented 5 years ago

Hello @peldszus I've looked into the dataset and I did not realize how huge each data point is (essentially each data point is a whole document). I was also getting OOM errors, but realized that this happened in a mini-batch with a data point that has over 60000 characters (causing every other data point in the mini-batch to be padded to this length). So the above-mentioned problem of the result tensor not even fitting into memory did in fact occur.

This is in theory fixable by keeping the cat operation on GPU, but this will not be fast and maybe for data points of this size we need to find an entirely different solution. So in this use case, given that there are so many words per document, normal word embeddings should probably do the trick and will just be much faster.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.