Nougat OCR does not recognize all UTF-8 Encoding characters

facebookresearch / nougat

Implementation of Nougat Neural Optical Understanding for Academic Documents

https://facebookresearch.github.io/nougat/

MIT License

8.81k stars 561 forks source link

Nougat OCR does not recognize all UTF-8 Encoding characters #139

Closed nekiee13 closed 11 months ago

nekiee13 commented 12 months ago

Nougat OCR does not recognize Š,š,Č,č,Ž,ž which are all part of UTF-8 Encoding. Is there some extra argument that should be passed, related to specific language - like Slovenian or Croatian or enforce the usage of UTF-8?

lukas-blecher commented 11 months ago

This is due to the fact that nougat can not understand other languages than english because of the data it was trained on. Sorry about that