CAMeL-Lab / CAMeLBERT

Code and models for "The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models". EACL 2021, WANLP.
https://aclanthology.org/2021.wanlp-1.10
MIT License
43 stars 10 forks source link

Max length problem with bert-base-arabic-camelbert-mix-pos-msa #6

Open dearden opened 8 months ago

dearden commented 8 months ago

What I'm doing

I'm using CamelBERT PoS tagging to process modern standard Arabic text, and I'm doing so as follows.

# make the model using pipeline
model = pipeline("token-classification", model="CAMeL-Lab/bert-base-arabic-camelbert-ca")

# run the model on some text
model("SOME ARABIC TEXT")

The problem

When running the model on texts with >512 words, I get the following error.

RuntimeError: The size of tensor a (563) must match the size of tensor b (512) at non-singleton dimension 1

As mentioned in this issue over in Camel Tools, it's a known CamelBERT problem and the solution is to use the tokeniser as follows:

tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca', max_length=512)

However, this does not fix the whole pipeline, and running pipeline with max_length=512 results in an error because the parameter does not exist.

What I've tried

I've tried doing the following...

tokeniser = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca', truncation=True, max_length=512)
model = pipeline("token-classification", model='CAMeL-Lab/bert-base-arabic-camelbert-ca', tokenizer=self.tokeniser)

but that doesn't work either. There's this warning...

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

which suggests that the parameter is being ignored even when we specify max_length.

I've almost got it working by doing the tokenisation and model separately.

tokeniser = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca', truncation=True, max_length=512)
model = AutoModelForTokenClassification.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')

text = "SOME ARABIC TEXT"
tokenised = tokeniser(text, return_tensors="pt", max_length=512, truncation=True)
output = model(**tokenised)

But then I get the output as tensors, and I'm not sure how to decode the output into human readable form.

Question

Is there a known fix or workaround to this problem? The output from CamelBERT is super useful, but there's quite a lot of texts with >512 tokens.

Thanks! And apologies if I'm just missing something obvious.

balhafni commented 8 months ago

Hi @dearden! A few points based on what you provided:

1) If you'd like to do POS tagging, we provide 12 fine-tuned models on Hugging Face's model hub here. The fine-tuned models differ across two dimensions: the base CAMeLBERT model (MSA, ca, mix, da) and the Arabic POS variant (gulf, MSA, and Egyptian). You can infer that from the name of the model (e.g., CAMeL-Lab/bert-base-arabic-camelbert-ca-pos-msa). The model you're trying to use in your example above is the CAMeLBERT CA model which is not fine-tuned to do POS tagging.

2) The warning you're getting is just because Hugging Face's pipeline doesn't have a way to specify the model max length if it's not set in the original config of the model. But it only appears once after calling the pipeline and shouldn't affect the results:

from transformers import pipeline, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camembert-MSA-pos-msa', max_length=512, truncation=True)
pos = pipeline('token-classification', model='CAMeL-Lab/bert-base-arabic-camembert-msa-pos-msa', tokenizer=tokenizer)
pos(text)

3) If you'd like to use our models for long texts with more than >512 tokens, I highly recommend looking at the code we have for NER in CAMeLTools and adapting it to your needs.

Hope this helps!

dearden commented 8 months ago

Hi @balhafni! Thanks for the quick response.

  1. Sorry, that's my mistake. I am actually using CAMeL-Lab/bert-base-arabic-camelbert-mix-pos-msa, but in the original code it's a variable name and I copied the value over from the wrong place.

  2. This is pretty much the code I was running, and led to the error.

    The size of tensor a (514) must match the size of tensor b (512) at non-singleton dimension 1

    for the input

    " ".join(["هذه ثماني كلمات لطيفة يجب معالجتها."] * 64

    Is that expected? My understanding is that setting max_length=512, truncation=True should truncate the text and stop that error from happening.

  3. I'll give that a look when I get a chance. Hopefully that can solve my problem - I'll get back to you if it doesn't work. Thanks!