CAMeL-Lab / camel_tools

A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.
MIT License
413 stars 73 forks source link

[BUG] Maximum Sequence Limit not set on Camel-bert Model #123

Closed FDSRashid closed 1 year ago

FDSRashid commented 1 year ago

Describe the bug The model for the camel-bert transformers on hugging face did not specify max Length, causing longer tokenized sentences to not process. I am unsure whether to actually use the camel-bert models, because I saw that they were last updated over two years ago. I want to get encoded sentences of texts in classical arabic, so If there are any models within camel-tools that are trained on classical arabic, that would be wonderful.

To Reproduce I used the Autotokenizer from pretrained command : tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca') model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')

Doing model.model_max_length returned a extremely large number. For this reason, I couldn't chunk sequences based on the model max length. I looked at the error code message, it mentioned the tensor needing to be 512 in length. Thats when i noticed the model maximum wasn't set.

Expected behavior Since the error message expects token lengths to be 512, i would expect the model max length to be set to 512. However, if theres a new model trained on classical arabic that is used in camel_tools, I apologize, i didnt find it. I just want to encode sentences from classical arabic texts.

Screenshots No screenshots, unfortunately, i fixed the error by setting a max length manually to another variable.

However, this was the text of the error message : RuntimeError: The size of tensor a (5338) must match the size of tensor b (512) at non-singleton dimension 1

Desktop (please complete the following information): Working from Google Colab

Additional context None, but if theres a pretrained model that is updated on camel_tools that is trained on classical arabic, that would be amazing.

owo commented 1 year ago

Hi @FDSRashid ,

This is an issue for CAMeLBERT. Can you please post the issue there and @balhafni will take a look.

balhafni commented 1 year ago

Hi @FDSRashid,

This is an issue in the way the configs were created for the CAMeLBERT models. We recommend always specifying the max_length, which has a maximum value of 512, whenever you use a CAMeLBERT tokenizer:

tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca', max_length=512)