Request: Dataset and pretrained model for language detection

clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

https://arxiv.org/abs/2111.15664

MIT License

5.52k stars 443 forks source link

Request: Dataset and pretrained model for language detection #286

Open turian opened 5 months ago

turian commented 5 months ago

MOTIVATION

Language detection from images is relatively difficult. Adobe and ABBYY OCR require you already know the language of the document before you start OCR.

REQUEST

Please use your document generator to generate documents in different languages.
Ideally, you would even mix different languages.
Release a pretrained model that estimates the percentage of each language in a particular document image.