eikek / docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
https://docspell.org
GNU Affero General Public License v3.0
1.51k stars 116 forks source link

Please consider adding Mandarin language #2030

Open iszhi opened 1 year ago

iszhi commented 1 year ago

I also have a lot of Documents written by Mandarin. Can you add this too?

eikek commented 1 year ago

I'm not against it at all, but it is for me not really doable, since I have zero knowledge of Mandarin. The NLP processors don't support it afaik, but tesseract (the tool doing the OCR) has support for chinese traditional and simplified, don't know if that would help?

For date recognition I would need a PR or at the very least all the info from here

iszhi commented 1 year ago

I'm not against it at all, but it is for me not really doable, since I have zero knowledge of Mandarin. The NLP processors don't support it afaik, but tesseract (the tool doing the OCR) has support for chinese traditional and simplified, don't know if that would help? @eikek Since NLP don't support Mandarin, can you add it via tesseract? (PS. I don't know either NLP and tesseract exactly.)

eikek commented 1 year ago

I think tesseract has support for simplified and traditional chinese - which one is better? It is possible to add it to the docker image and add a language option to the ui.

iszhi commented 1 year ago

In China, simplified Chinese is used in mainland China, and traditional Chinese is used in Taiwan and Hong Kong. Simplified Chinese means more user base. But if possible, I recommend installing two languages.

kxu1988 commented 4 months ago

I'm not against it at all, but it is for me not really doable, since I have zero knowledge of Mandarin. The NLP processors don't support it afaik, but tesseract (the tool doing the OCR) has support for chinese traditional and simplified, don't know if that would help?

For date recognition I would need a PR or at the very least all the info from here

Stanford CoreNLP support (mainland) Chinese.

Stanford CoreNLP [backup download page] An integrated suite of natural language processing tools for English, Spanish, and (mainland) Chinese in Java, including tokenization, part-of-speech tagging, named entity recognition, parsing, and coreference