Generating predictions - Githubissues

chuktuk commented 4 years ago

I'm using a pretrained model for sentiment analysis, and I'm trying to assess its accuracy on my data. I want to generate predictions for my data and compare to the labels (supervised style accuracy). I'd like to create the confusion matrix and evaluate typical metrics. Maybe I'm just missing this functionality somewhere, but I'm trying to loop predict calls, and it is taking a very long time (hours) to run, and my machine has 16 GB RAM. I would appreciate it if someone could help with this issue.

For my code below, test_text is pandas series containing 124355 reviews for sentiment analysis. Each entry could be a few words or a paragraph. I also tried creating functions and using the .apply() method of a series, and this also took hours to run.

This is my code (the formatting is off for the for loop, maybe I just formatted it wrong here, it is indented in my code):

from flair.models import TextClassifier
classifier = TextClassifier.load('en-sentiment')
from flair.data import Sentence

scores = defaultdict(float)
values = defaultdict(str)
i = 0

for text in test_text:
        sentence = Sentence(text)
        classifier.predict(sentence)
        scores[i] = sentence.labels[0].score
        values[i] = sentence.labels[0].value
        i+=1

the output of sentence.labels[0].score is a float that appears to represent the confidence in the value the output of sentence.labels[0].value is either 'POSITIVE' or 'NEGATIVE'

alanakbik commented 4 years ago

Hello @chuktuk you should use batching to increase speed, i.e. always passing a list of 16, 32 or even 64 sentences at once. For this, you need to split your data into a list of lists and always pass a list of Sentence to the predict function. It works like this:

# a list of your sentences
sentences = [Sentence('I love this movie'), Sentence('This movie is terrible')]

# predict for all sentences
classifier.predict(sentences, mini_batch_size=32)

# check predictions
for sentence in sentences:
        scores[i] = sentence.labels[0].score
        values[i] = sentence.labels[0].value
        i+=1

That should very much increase speed. The mini-batch size parameter controls how many sentences are analyzed at the same time. So if your list is longer than 32 in this example, it will be split automatically.

One thing to be aware of is that the sentiment model was trained on movie reviews, so it may not work very well for other types of sentiment data.

chuktuk commented 4 years ago

Thanks for the tip. I'm trying to run in Jupyter Lab and on Google Colab to see if I can get it to work this way. Currently, I'm running the following code, and I got it to work running on only 1000 reviews. I'm trying 6250 currently on Jupyter, and Google Colab is running on 50000 reviews using the TPU. I'll report back if I can get it to work with a significant number of reviews.

import pandas as pd
from collections import deque
from flair.models import TextClassifier
from flair.data import Sentence

classifier = TextClassifier.load('en-sentiment')

# use deque to save memory, only use 6250 reviews
sentences1 = deque()
test_text1 = test_text[:6250] # test_text1 is a pandas series of reviews

# get deque of sentences
for text in test_text1:
    sentences1.append(Sentence(text))

# initialize values for loop
scores = defaultdict(float)
values = defaultdict(str)
i = 0

# predict for all sentences (mini_batch_size=32 worked with 1000 reviews on Colab)
classifier.predict(sentences1, mini_batch_size=16)

# deque version to save memory
while len(sentences1) > 0:
    sentence = sentences1.popleft()
    scores[i] = sentence.labels[0].score
    values[i] = sentence.labels[0].value
    i+=1

# convert to dataframe
df1 = pd.DataFrame({'scores': scores, 'values': values})

chuktuk commented 4 years ago

I was able to get this to run on Google Colab using mini_batch_size=32.

Another quick question, I had inadequate F1 scores with just 0.41 on the negative class, accuracy of 0.76, and auc of 0.5. I'm training my own model, since the vocabulary is likely very different from the pretrained model, and I was wondering if there is an easy way to get a report of the evaluation metrics from the training of the model, as validation and test datasets are included in the training process. I couldn't find anything in the documentation that looks like what I'm looking for.

Thanks

alanakbik commented 4 years ago

If you use our training routines, the validation score should be printed and added to the log files at the end of each epoch. If you also want to monitor the score on the test data, you can set monitor_test=True in the ModelTrainer.

chuktuk commented 4 years ago

Thanks, I've included this in the trainer. I ran last night, but unfortunately Google Colab timed out and I lost all of the files used for/generated by training. I'm running again now when I can keep it from timing out.

If anyone else reads this, you can extend the timeout period for Google Colab by opening the inspector (ctrl + shift + i), then clicking Console at the top, and entering the following JS command:

function ClickConnect(){
console.log("Working"); 
document.querySelector("colab-toolbar-button#connect").click() 
}
setInterval(ClickConnect,60000)

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

flairNLP / flair

Generating predictions #1443