Closed emiliepicardcantin closed 2 years ago
Hi Emilie!
Your code sample can actually run without errors on the colab demo with python 3.7 and transformers 4.12.3. Can you provide more information on your environment (python & package versions) ?
Fundamentally, the issue seems to come from the number tokens generated from the input text (the BERT model can only handle up to 512 tokens). However, the pipeline
should automatically truncate the input. This similar issue might help: https://github.com/huggingface/transformers/issues/11065
Hi Theophile ! I am working in a Jupyter Notebook on Azure Machine Learning Studio. Here are some information about the environment :
Python version : 3.8.1 (default, Jan 8 2020, 22:29:32) Tensorflow version : 2.6.0 Transformers version : 4.12.2
You say that the algorithm truncates the input automatically. How ? From the beginning of the string ? From the end ? When you say tokens, you mean nlp tokens (like words) and not characters, right ?
Hi again! I managed to reproduce your issue and to find a possible fix. Can you please try again with the following lines:
tokenizer = AutoTokenizer.from_pretrained("tblard/tf-allocine", use_fast=True)
model = TFAutoModelForSequenceClassification.from_pretrained("tblard/tf-allocine")
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, truncation=True, max_length=512)
The culprit might be the tokenizer, which by default have tokenizer.model_max_length = 1000000000000000019884624838656
. I also noticed that the following code works:
tokenizer = AutoTokenizer.from_pretrained("tblard/tf-allocine", use_fast=True, model_max_length=512)
model = TFAutoModelForSequenceClassification.from_pretrained("tblard/tf-allocine")
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, truncation=True)
Now, regarding truncation, there's an in depth documentation here. For text classification it's simple: the tokenizer keeps the first 512 (max_length
) generated tokens (so, yes: it keeps the beginning and ignore the end of the text). You can observe this by yourself with the following code:
text = "J'aime le camembert"
tokens = tokenizer.encode_plus(text, max_length=5, truncation=True)
truncated_text = tokenizer.decode(tokens['input_ids'], skip_special_tokens=True)
print(len(tokens['input_ids'])) # 5
print(truncated_text) # J'aime
Because we use subword tokenizers, the number of generated tokens is > to the number of words, as the tokenizer might broke down words into multiple tokens. It also automatically adds special tokens. Truncation is often the best way to deal with long sentences; in your example the polarity can typically be inferred from the first word. Keep also in mind that the model was trained on the Allociné dataset, which does not contain a lot of long reviews.
Thank you so much for your in depth response. I will try the propose code and get back to you when I can.
Thank you so much for your in depth response. I will try the propose code and get back to you when I can.
Hi @emiliepicardcantin, did you manage to make it work?
I have quite long texts that I want to label using your module. I run into the same problem over and over again using the nlp pipeline. Here is my code :
I get the following error :
See attachment for more details on the error. Can someone help me ? Thank you !
error_nlp_pipeline.txt