en-sentiment model contains labels that don't match IMDB dataset

pommedeterresautee commented 5 years ago

Describe the bug

According to TUTORIAL 2, classification model en-sentiment has been trained on IMDB dataset.

evaluate() function on en-sentiment model trained on IMDB produces 0 score when tested on test set of IMDB.

Reasons seem to be a change in label names.

Model: ['POSITIVE'], ['NEGATIVE'] dataset: ['pos'], ???

I have not a single prediction with negative label. (which is another issue)

IMDB is not marked as deprecated in source code.

To Reproduce

from flair.datasets import IMDB, DataLoader

from flair.models import TextClassifier

classifier = TextClassifier.load('en-sentiment')
corpus = IMDB()

sentences = list(corpus.test)

test_results, a = classifier.evaluate(data_loader=DataLoader(sentences[:100], batch_size=16))
print(test_results.detailed_results)

Print:

# print(test_results.detailed_results)
MICRO_AVG: acc 0.0 - f1-score 0.0
MACRO_AVG: acc 0.0 - f1-score 0.0
NEGATIVE   tp: 0 - fp: 13 - fn: 0 - tn: 87 - precision: 0.0000 - recall: 0.0000 - accuracy: 0.0000 - f1-score: 0.0000
POSITIVE   tp: 0 - fp: 87 - fn: 0 - tn: 13 - precision: 0.0000 - recall: 0.0000 - accuracy: 0.0000 - f1-score: 0.0000

Expected behavior A score which is not zero

Solution

Retrain / reshare a new en-sentiment model? // redo IMDB dataset

Context PiPy version of Flair / same bug on master branch

Note

Btw, the model seems to not work very well.

"I never saw something that bad." -> positive "I do not like this film" -> positive "I hate this film" -> negative (finally)

tombburnell commented 5 years ago

FYI I ran the Imdb model against some opinion based news and I didn't find the results to be all that meaningful.

alanakbik commented 5 years ago

Yes this model was trained using IMDB data, i.e. film related, so its only a sentiment model for movie reviews. I have to check why the label names changed, very strange!

pommedeterresautee commented 5 years ago

@alanakbik This is very strange. I have cleaned $HOME/.flair folder. Then I executed that code

from flair.datasets import IMDB, DataLoader

from flair.models import TextClassifier

classifier = TextClassifier.load('en-sentiment')
corpus = IMDB()

sentences = list(corpus.test)

It downloads what it has to.

2019-10-05 23:01:17,042 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/models-v0.4/classy-imdb-en-rnn-cuda%3A0/imdb-v0.4.pt not found in cache, downloading to /tmp/tmpgxn7mqor
100%|██████████| 1501979561/1501979561 [00:32<00:00, 45600891.88B/s]
2019-10-05 23:01:50,139 copying /tmp/tmpgxn7mqor to cache at /home/.../.flair/models/imdb-v0.4.pt
2019-10-05 23:01:51,112 removing temp file /tmp/tmpgxn7mqor
2019-10-05 23:01:51,227 loading file /home/.../.flair/models/imdb-v0.4.pt
2019-10-05 23:02:00,766 http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz not found in cache, downloading to /tmp/tmpm1n0lup3
100%|██████████| 84125825/84125825 [00:10<00:00, 8308111.70B/s] 
2019-10-05 23:02:11,202 copying /tmp/tmpm1n0lup3 to cache at /home/.../.flair/datasets/imdb/aclImdb_v1.tar.gz
2019-10-05 23:02:11,267 removing temp file /tmp/tmpm1n0lup3
2019-10-05 23:02:22,099 Reading data from /home/.../.flair/datasets/imdb
2019-10-05 23:02:22,099 Train: /home/.../.flair/datasets/imdb/train.txt
2019-10-05 23:02:22,099 Dev: None
2019-10-05 23:02:22,099 Test: /home/.../.flair/datasets/imdb/test.txt

Then I do

sentences[0].labels
# Out[12]: [pos (1.0)]

And

a = classifier.predict(sentences=sentences[:100],
                   mini_batch_size=16,
                   embedding_storage_mode="none")
a[0].labels
# Out[13]: [POSITIVE (0.9999998807907104)]

So labels ARE different.

Of course, now:

sentences[0].labels
# Out[14]: [POSITIVE (0.9999998807907104)]

as sentence labels are overriden by predict() call.

first line of /home/.../.flair/datasets/imdb/test.txt

__label__pos Back in 1982 a little film called MAKING LOVE shocked audiences with its frank and open depiction of a romantic love story that just happened to be about two men.<br /><br />I have been waiting for years for a good, old-fashioned romance between two men; LATTER DAYS is all that and more.<br /><br />Yes, it is soapy, melodramatic, cliché-ridden, and quite corny. That is what makes it so wonderful. There is nothing like a good romantic movie, and this movie is romantic in the best sense of the word.<br /><br />As to the issue of religion, sorry folks, but these things do happen and are happening to gay people even now. It is not just the Mormon church that rejects its gay members. Gay people in every religion have faced harsh judgment and rejection.<br /><br />I loved this movie. It has a perfect blend of a fantasy-romance grounded in the reality of the day-to-day lives of the characters. If I could give it more than ten stars I would. Good love stories never go out of style; great love stories like LATTER DAYS are unforgettable.<br /><br />It's about time!

To make it short, label from model is POSITIVE and label from dataset is pos.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

alanakbik commented 4 years ago

PR #1545 adds new sentiment datasets and homogenizes labels across datasets. Also, there is now an option to define "name maps" to map label names to other names.

flairNLP / flair

en-sentiment model contains labels that don't match IMDB dataset #1165