Donut-How to use a tokenizer with multiple language support

NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.

MIT License

8.48k stars 1.33k forks source link

Donut-How to use a tokenizer with multiple language support #333

Open Theerath opened 11 months ago

Theerath commented 11 months ago

Is it possible to use a different tokenizer with multiple language support for the Donut processor? like mbart tokenizer in the Donut processor instead of xlmrobertafast @NielsRogge

NielsRogge commented 11 months ago

Hi,

Donut has been pre-trained on 4 languages: English, Chinese, Korean and Japanese. The tokenizer however supports 100 different languages, as it uses the one of XLM-RoBERTa. If you want to use a different tokenizer, you will have to train a new model from scratch.

Shubham4471 commented 3 months ago

Hi, Will Donut able to extract English + Arabic text at same time