UDOP - How to change UdopToknizer to another tokenizer that supports CJK languages?

NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.

MIT License

9.15k stars 1.42k forks source link

UDOP - How to change UdopToknizer to another tokenizer that supports CJK languages? #404

Open pascona opened 6 months ago

pascona commented 6 months ago

Hi @NielsRogge

We are following the Fine_tune_UDOP_on_a_customdataset(toy_RVL_CDIP_dataset).ipynb notebook example. We used OCR text and coordinates based on CJK (Chinese, Japanese, Korean). However, it seems that UDOPTokenizer does not support CJK. Can you provide a guide or notebook code to change to the LayoutXLMTokenizer instead of the UDOPTokenizer?

NguyenHongSon1103 commented 5 months ago

+1 the same problem