Length of text passed to nlp pipeline

emiliepicardcantin commented 3 years ago

I have quite long texts that I want to label using your module. I run into the same problem over and over again using the nlp pipeline. Here is my code :

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from transformers import pipeline
from tqdm import tqdm

tokenizer = AutoTokenizer.from_pretrained("tblard/tf-allocine", use_fast=True)
model = TFAutoModelForSequenceClassification.from_pretrained("tblard/tf-allocine")

nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

text = """
    DECU!!! Je suis venu a axa pour un devis le coeur bien décider de quitter la macif. J'ai apprécié la rapidité du rappel pour discuter des propositions de axa. Mais hélas, c'est le cauchemar ensuite. L'insistance de vouloir conclure une vente sans ECOUTER les besoins du client me met hors de moi. A plusieurs reprises, on lui explique que je souhaitais comparer ce qui était comparable, donc je lui demande des devis pour une couverture TOUS RISQUES, ainsi que le depannage 0KM et le pret d'un vehicule de location en cas de reparation, elle me dit oui oui oui, voici mon offre a 30e moins cher MENSUEL!
    Quelle bonne affaire! je suis ravi avec la MACIF!
     "- Il faut absolument se depecher, l'offre est que pour aujourd'hui!! 
      - j'ai besoin de reflechir madame, j'en discute avec mon conjoint.
      - Okay, offre jusqu'a demain. "
    le lendemain, l'excitation l'emporte, si j"economise 300e a l'année pour la meme couverture, le debat n'est pas long. Je reconfirme les couvertures, et elle dit oui, donc j'accepte.
    Je recois mon contrat d'assurance, et je me rend compte que c'est au tiers que la vendeuse m'a souscrit. Je rappelle pour avoir des explications, on me dit que non, elle avait bien expliqué. Je dis non, merci. Je veux etre couvert TOUS RISQUES. Elle me dit que je peux baisser mes garanties, et pas les augmenter. Je suis perdu... Je n'ai jamais entendu parler d'assureur qui refusait d'augmenter de couverture de garanties au souhait du client.
    Je me renseigne autour de moi, et je trouve l'article 112-9 du code des assureurs, qui protege les clients avec un droit de retractation de 14 jours ouvrés suite un achat/vente par telephone ou en ligne. Le service gestion de cette meme plateforme telephonique m'INFORME que cette loi n'est pas d'actualité, que meme si AXA.fr l'affiche sur son site web, ils n'en tiennent pas compte!! 
    La blague c'est que une conseillere d'une agence situé sud Loire a Rezé (BIEN EN FRANCE, contrairement a cette plateforme telephonique) m'INFORME bel et bien que j'ai droit a ce recours de retractation en cas d'insastifaction dans les 14 jours ouvré et que cette situation n'aurait jamais eu lieu si j'aurais été face a face un conseiller qui aurait fait son travail comme il le fallait, c'est a dire ECOUTER LE CLIENT ET SES BESOINS! 
    J'envoi mes lettres de retractation selon le conseil de la conseillere (FRANCAISE!) parce que je refuse d'écouter mesdames ASMA, JASMINE, FAAD, NMASRI et vendeuse SARA et n'importe qui d'autre de ce centre d'appels à l'adresse suivante: centre de service internet - TSA 81110 69836 Saint Priest CEDEX) Ces gens la ne cherche pas a me proposer une couverture tous risques, ni a une retractation legale. J'envoi la meme lettre au siege de AXA a paris (25 ave matignon 75008 Paris) et j'irai jusqu'au bout de ce litige.. PAS D'HONNEUR ET DE RESPECT pour des affaires de ce genre. Je suis quelqu'un avec des principes, aujourd'hui les gens de AXA ne les respectent pas.
    Nous nous engageons a faire ce qu'on dit, on a que notre honneur dans ce monde... des gens qui vous disent on vous rappelle et ne le font pas, ce ne sont que des laches, sans professionalisme et ne meritent pas de representer le groupe AXA.
""""

nlp(text)

I get the following error :

InvalidArgumentError: indices[0,512] = 514 is not in [0, 514) [Op:ResourceGather]

See attachment for more details on the error. Can someone help me ? Thank you !

error_nlp_pipeline.txt

TheophileBlard commented 3 years ago

Hi Emilie!

Your code sample can actually run without errors on the colab demo with python 3.7 and transformers 4.12.3. Can you provide more information on your environment (python & package versions) ?

Fundamentally, the issue seems to come from the number tokens generated from the input text (the BERT model can only handle up to 512 tokens). However, the pipeline should automatically truncate the input. This similar issue might help: https://github.com/huggingface/transformers/issues/11065

emiliepicardcantin commented 3 years ago

Hi Theophile ! I am working in a Jupyter Notebook on Azure Machine Learning Studio. Here are some information about the environment :

Python version : 3.8.1 (default, Jan 8 2020, 22:29:32) Tensorflow version : 2.6.0 Transformers version : 4.12.2

You say that the algorithm truncates the input automatically. How ? From the beginning of the string ? From the end ? When you say tokens, you mean nlp tokens (like words) and not characters, right ?

TheophileBlard commented 3 years ago

Hi again! I managed to reproduce your issue and to find a possible fix. Can you please try again with the following lines:

tokenizer = AutoTokenizer.from_pretrained("tblard/tf-allocine", use_fast=True)
model = TFAutoModelForSequenceClassification.from_pretrained("tblard/tf-allocine")
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, truncation=True, max_length=512)

The culprit might be the tokenizer, which by default have tokenizer.model_max_length = 1000000000000000019884624838656. I also noticed that the following code works:

tokenizer = AutoTokenizer.from_pretrained("tblard/tf-allocine", use_fast=True, model_max_length=512)
model = TFAutoModelForSequenceClassification.from_pretrained("tblard/tf-allocine")
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, truncation=True)

Now, regarding truncation, there's an in depth documentation here. For text classification it's simple: the tokenizer keeps the first 512 (max_length) generated tokens (so, yes: it keeps the beginning and ignore the end of the text). You can observe this by yourself with the following code:

text = "J'aime le camembert"
tokens = tokenizer.encode_plus(text, max_length=5, truncation=True)
truncated_text = tokenizer.decode(tokens['input_ids'], skip_special_tokens=True)
print(len(tokens['input_ids'])) # 5
print(truncated_text) # J'aime

Because we use subword tokenizers, the number of generated tokens is > to the number of words, as the tokenizer might broke down words into multiple tokens. It also automatically adds special tokens. Truncation is often the best way to deal with long sentences; in your example the polarity can typically be inferred from the first word. Keep also in mind that the model was trained on the Allociné dataset, which does not contain a lot of long reviews.

emiliepicardcantin commented 3 years ago

Thank you so much for your in depth response. I will try the propose code and get back to you when I can.

TheophileBlard commented 2 years ago

Thank you so much for your in depth response. I will try the propose code and get back to you when I can.

Hi @emiliepicardcantin, did you manage to make it work?

TheophileBlard / french-sentiment-analysis-with-bert

Length of text passed to nlp pipeline #14