Closed GRIGORR closed 2 years ago
That's because all of our CAMeLBERT models use the same tokenizer, which was trained on 167GB of mixed Arabic texts (i.e., MSA, DA, and CA) as described in our paper.
I thought different models will have different tokenizers. Thanks
I have 680K Arabic sentences and I have used the following tokenizers to tokenize them, but they all gave the same result.
I have used them by
transformers version - 4.6.1