Same result for different tokenizers

CAMeL-Lab / CAMeLBERT

Code and models for "The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models". EACL 2021, WANLP.

https://aclanthology.org/2021.wanlp-1.10

MIT License

43 stars 10 forks source link

Same result for different tokenizers #2

Closed GRIGORR closed 2 years ago

GRIGORR commented 2 years ago

I have 680K Arabic sentences and I have used the following tokenizers to tokenize them, but they all gave the same result.

'CAMeL-Lab/bert-base-arabic-camelbert-mix',
'CAMeL-Lab/bert-base-arabic-camelbert-ca',
'CAMeL-Lab/bert-base-arabic-camelbert-msa',
'CAMeL-Lab/bert-base-arabic-camelbert-da'

I have used them by

tok = AutoTokenizer.from_pretrained(tokenizer_name, force_download=True)

transformers version - 4.6.1

balhafni commented 2 years ago

That's because all of our CAMeLBERT models use the same tokenizer, which was trained on 167GB of mixed Arabic texts (i.e., MSA, DA, and CA) as described in our paper.

GRIGORR commented 2 years ago

I thought different models will have different tokenizers. Thanks