TheophileBlard / french-sentiment-analysis-with-bert

How good is BERT ? Comparing BERT to other state-of-the-art approaches on a French sentiment analysis dataset
MIT License
146 stars 35 forks source link

create an api / online demo #1

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hi,

Hope you are all well !

Yeah, for sure, an online service demo would be awesome. When do you think you can make it live ?

Cheers, Luc Michalski

TheophileBlard commented 4 years ago

Hi, I created an online colab notebook demo, with which you can run the model on your own sentences. Developing an actual API is however beyond the scope of this project.

Keep me posted if it doesn't work on your side!

ghost commented 4 years ago

Hi,

Thanks for the reply, when I tried to classify a sentence it triggered the following error:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-d42d94d0934f> in on_button_clicked(b)
     29 def on_button_clicked(b):
     30   text = text_area.value
---> 31   X = preprocessor.transform([text])
     32   scores = model.predict(X)
     33   y_pred = np.argmax(scores[0], axis=1)

NameError: name 'preprocessor' is not defined

Btw, if you provide me an example of one sentence to analyse, in a python script, I can make a rest server with flask to test on a large amount of tweet with a scraping tool like https://github.com/twintproject/twint. That would be awesome.

Thanks in advance for your input and insights.

TheophileBlard commented 4 years ago

Hi,

Thanks for the reply, when I tried to classify a sentence it triggered the following error:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-d42d94d0934f> in on_button_clicked(b)
     29 def on_button_clicked(b):
     30   text = text_area.value
---> 31   X = preprocessor.transform([text])
     32   scores = model.predict(X)
     33   y_pred = np.argmax(scores[0], axis=1)

NameError: name 'preprocessor' is not defined

Btw, if you provide me an example of one sentence to analyse, in a python script, I can make a rest server with flask to test on a large amount of tweet with a scraping tool like https://github.com/twintproject/twint. That would be awesome.

Thanks in advance for your input and insights.

This is because you didn't run the other cells of the notebook. I tried to put together a friendly user-interface, but ultimately notebooks are nothing but code. You can run all cells with the Runtime/Run all menu entry, or with Ctrl+F9. This should not take more than 1-2 minutes.

Everything you need to do inference on one sentence is in the notebook ! Thanks to the transformers library, this is very short :

import numpy as np
import tensorflow as tf
assert tf.__version__ >= "2.0" 

from transformers import CamembertTokenizer, TFCamembertForSequenceClassification

# Preprocessing 
def encode_reviews(tokenizer, reviews, max_length):
    token_ids = np.zeros(shape=(len(reviews), max_length),
                         dtype=np.int32)
    for i, review in enumerate(reviews):
        encoded = tokenizer.encode(review, max_length=max_length)
        token_ids[i, 0:len(encoded)] = encoded
    attention_mask = (token_ids != 0).astype(np.int32)
    return {"input_ids": token_ids, "attention_mask": attention_mask}

# Load model 
MODEL_FOLDER = "camembert_sentiment" # Local model
tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = TFCamembertForSequenceClassification.from_pretrained(MODEL_FOLDER)

# Inference
MAX_SEQ_LEN = 400
text = "Ce film était génial !"
X = encode_reviews(tokenizer, [text], MAX_SEQ_LEN)
scores = model.predict(X)
y_pred = np.argmax(scores[0], axis=1)
# y_pred = 0 if negative, 1 if positive
# here, y_pred shoud be 1

You need to download the pre-trained model, and change the MODEL_FOLDER variable if needed. You can also perform inference for multiple sentence at the same time, by changing the last lines accordingly.

Let me know if you manage to make your REST server live !

ghost commented 4 years ago

awesome thanks a lot, I am starting now

ghost commented 4 years ago

Last stupid question ^^, What do you think of https://github.com/NVIDIA/sentiment-discovery ? Is it compatible with camemBERT ?

TheophileBlard commented 4 years ago

Last stupid question ^^, What do you think of https://github.com/NVIDIA/sentiment-discovery ? Is it compatible with camemBERT ?

CamemBERT is basically Facebook's RoBERTa model trained on a French corpus. RoBERTa/CamemBERT are Language Models. You cannot do anything with a Language Model, unless predicting next/previous words (depending on how it was trained). You need to finetune it on a downstream task, for example sentiment analysis (which is what I do in my repo).

The repo you sent me seems to be another approach for training and finetuning Language Models. However, they are using their own models, as I see no references to BERT. Moreover, they do share they pre-trained weights, but they are only in English language. Training a Language Model from scratch is difficult. For CamemBERT they used 138GB of data and 256 Nvidia V100 GPUs.

If you want to add CamemBERT in their repo, you will have to modify the python code. Because it is PyTorch based, the transformers library should help.

ghost commented 4 years ago

Thanks for the clarification :-)

I still work on the rest api. Keep u in touch.

Merci

ghost commented 4 years ago

Here is what I wrote for the rest api: https://gist.github.com/x0rzkov/111f8081c30c5ed82268bbca30729072

but I have the following error, can you spot why it happens ?

src % python3 allocine.py
2020-03-31 10:34:21.619626: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-31 10:34:21.772516: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fb4525cdb30 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-31 10:34:21.772564: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
 * Serving Flask app "allocine" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
text: Ce film était génial 
Traceback (most recent call last):
  File "allocine.py", line 45, in process
    y_pred = np.argmax(scores[0], axis=1)
  File "<__array_function__ internals>", line 6, in argmax
  File "/usr/local/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 1153, in argmax
    return _wrapfunc(a, 'argmax', axis=axis, out=out)
  File "/usr/local/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
    return bound(*args, **kwds)
numpy.AxisError: axis 1 is out of bounds for array of dimension 1
127.0.0.1 - - [31/Mar/2020 10:35:29] "POST /process HTTP/1.1" 400 -
TheophileBlard commented 4 years ago

This is strange. The error is triggered by the numpy argmax function. Can you add these lines, and give me the output on a sample sentence ?

print(type(scores))
print(scores)
print(scores[0])
print(scores[0].shape)

On my end, I have this:

image

If your shape is (2,), you should try to change y_pred = np.argmax(scores[0]) or y_pred = np.argmax(scores, axis=1). The output of the model is maybe different because you're using the CPU version, or some other dependency issue.

ghost commented 4 years ago

I have the following, I use python 3.7.7 on MacOSX.

% python3 allocine.py
2020-04-01 06:11:44.236649: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-01 06:11:44.381179: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fe8df4b0db0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-01 06:11:44.381221: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
 * Serving Flask app "allocine" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
text: Ce film était génial 
<class 'numpy.ndarray'>
[[-2.947382  2.92543 ]]
[-2.947382  2.92543 ]
(2,)
Traceback (most recent call last):
  File "allocine.py", line 50, in process
    y_pred = np.argmax(scores[0], axis=1)
  File "<__array_function__ internals>", line 6, in argmax
  File "/usr/local/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 1153, in argmax
    return _wrapfunc(a, 'argmax', axis=axis, out=out)
  File "/usr/local/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
    return bound(*args, **kwds)
numpy.AxisError: axis 1 is out of bounds for array of dimension 1
ghost commented 4 years ago

with y_pred = np.argmax(scores[0]) it works. Is there a way to manage both cases ?

TheophileBlard commented 4 years ago

with y_pred = np.argmax(scores[0]) it works. Is there a way to manage both cases ?

In fact, I think your code is working better than mine. I don't know why predict returns a tuple on my side.

On the other hand, your predict function returns a numpy array, of dimension (N,2), with N the number of inputs and 2 the output logits. This is great. I think this should work :

text = "super film !"
X = encode_reviews(tokenizer, [text], MAX_SEQ_LEN)
scores = model.predict(X)
y_pred = np.argmax(scores, axis=1)
pred = y_pred[0] # should be 1

And you can also make multiple predictions with the same code:

text_1 = "super film !"
text_2 = "très nul :/"
X = encode_reviews(tokenizer, [text_1, text_2], MAX_SEQ_LEN)
scores = model.predict(X)
y_pred = np.argmax(scores, axis=1)
pred_1 = y_pred[0] # should be 1
pred_2 = y_pred[1] # should be 0

You might have to take a look at the numpy documentation.

ghost commented 4 years ago

Thanks for your reply ! I ll test it now.

Is it possible to have neutral score prediction like between 0.4/0.6 ?

TheophileBlard commented 4 years ago

I didn't train the network to do it. Only 2 classes : negative and positive reviews.

Neutral reviews might be those where the output logits (columns of the scores array) are very close, because the network "hesitates" between negative and positive. You can also use the logits to calculate the probability to be negative or positive.

TheophileBlard commented 4 years ago

Hi @lucmichalski, how is your Twitter project going ?

Just wanted to let you know that the CamemBERT model is now integrated in the transformers project, and you can now perform inference really easily :

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("tblard/tf-allocine")
model = TFAutoModelForSequenceClassification.from_pretrained("tblard/tf-allocine")
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

print(nlp("Alad'2 est clairement le meilleur film de l'année 2018.")) # POSITIVE
print(nlp("Je m'attendais à mieux de la part de Franck Dubosc !")) # NEGATIVE

The colab demo has also been updated accordingly.