Closed FDSRashid closed 1 year ago
Hi @FDSRashid ,
This is an issue for CAMeLBERT. Can you please post the issue there and @balhafni will take a look.
Hi @FDSRashid,
This is an issue in the way the configs were created for the CAMeLBERT models. We recommend always specifying the max_length
, which has a maximum value of 512, whenever you use a CAMeLBERT tokenizer:
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca', max_length=512)
Describe the bug The model for the camel-bert transformers on hugging face did not specify max Length, causing longer tokenized sentences to not process. I am unsure whether to actually use the camel-bert models, because I saw that they were last updated over two years ago. I want to get encoded sentences of texts in classical arabic, so If there are any models within camel-tools that are trained on classical arabic, that would be wonderful.
To Reproduce I used the Autotokenizer from pretrained command :
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca') model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-ca')
Doing model.model_max_length returned a extremely large number. For this reason, I couldn't chunk sequences based on the model max length. I looked at the error code message, it mentioned the tensor needing to be 512 in length. Thats when i noticed the model maximum wasn't set.
Expected behavior Since the error message expects token lengths to be 512, i would expect the model max length to be set to 512. However, if theres a new model trained on classical arabic that is used in camel_tools, I apologize, i didnt find it. I just want to encode sentences from classical arabic texts.
Screenshots No screenshots, unfortunately, i fixed the error by setting a max length manually to another variable.
However, this was the text of the error message : RuntimeError: The size of tensor a (5338) must match the size of tensor b (512) at non-singleton dimension 1
Desktop (please complete the following information): Working from Google Colab
Additional context None, but if theres a pretrained model that is updated on camel_tools that is trained on classical arabic, that would be amazing.