[tika] Polish PDF recognized as Thai

dietervu commented 7 years ago

The attached file novel-pl.pdf is currently marked as Thai. Very strange since it clearly does not contain any thai character. Can we check if language recognition works for PDFs?

claus-zinn commented 7 years ago

Hi Dieter,

will look into the issue. For some pdf files, the language detection works, for others it doesn’t. Same for docx documents. I may need to covert the input to text first before calling Apache Tika. Keep you posted.

Best,

Claus

On 17. Jul 2017, at 11:29, dietervu notifications@github.com wrote:

The attached file novel-pl.pdf https://github.com/clarin-eric/LRSwitchboard/files/1152094/novel-pl.pdf is currently marked as Thai. Very strange since it clearly does not contain any thai character. Can we check if language recognition works for PDFs?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clarin-eric/LRSwitchboard/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/APXn3RVlRzGQ9V_IaRDfJABF8rop_Gv9ks5sOymUgaJpZM4OZu1F.

Claus Zinn claus.zinn@uni-tuebingen.de

claus-zinn commented 7 years ago

The issue has been resolved. For PDF, RTF and DOCX file, Apache Tika is used to convert the file to plain text. Language identification is then performed on the converted file.

clarin-eric / LRSwitchboard

[tika] Polish PDF recognized as Thai #8