Layout-Parser / layout-parser

A Unified Toolkit for Deep Learning Based Document Image Analysis
https://layout-parser.github.io/
Apache License 2.0
4.78k stars 459 forks source link

Does it work on other languages e.g. Chinese or Japanese Documents? #37

Closed Kuldeep-Attri closed 3 years ago

Kuldeep-Attri commented 3 years ago

I am working with some documents written in Japanese or Chinese. Will it work on them, if not how can we make it work documents written in other languages?

lolipopshock commented 3 years ago

I would say yes to both of your questions, and the first question is more related to what kind of data you're trying to use?

So a good property of image-based layout analysis is that, it relies less on the "language" it trained on but the type of the document you're going to use. For example, for scientific documents, you might expect the PubLayNet model can generalize well on foreign languages like Japanese or Chinese even it is trained on English papers.

And speaking of the training new documents, yes, it should be straightforward to do so - please check the layout-model-training repo for more details.

Kuldeep-Attri commented 3 years ago

@lolipopshock Thank you very much. I was kind of on the same page and now I feel much more confident. I would try to train it on a different style of document and see the results. Thank you for the link.