clarin-eric / LRSwitchboard

DEPRECATED - Please see https://github.com/clarin-eric/switchboard for latest version - Code Repository for the Language Resources Switchboard of CLARIN
Other
1 stars 0 forks source link

[tika] Polish PDF recognized as Thai #8

Closed dietervu closed 7 years ago

dietervu commented 7 years ago

The attached file novel-pl.pdf is currently marked as Thai. Very strange since it clearly does not contain any thai character. Can we check if language recognition works for PDFs?

claus-zinn commented 7 years ago

Hi Dieter,

will look into the issue. For some pdf files, the language detection works, for others it doesn’t. Same for docx documents. I may need to covert the input to text first before calling Apache Tika. Keep you posted.

Best,

Claus

On 17. Jul 2017, at 11:29, dietervu notifications@github.com wrote:

The attached file novel-pl.pdf https://github.com/clarin-eric/LRSwitchboard/files/1152094/novel-pl.pdf is currently marked as Thai. Very strange since it clearly does not contain any thai character. Can we check if language recognition works for PDFs?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clarin-eric/LRSwitchboard/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/APXn3RVlRzGQ9V_IaRDfJABF8rop_Gv9ks5sOymUgaJpZM4OZu1F.

Claus Zinn claus.zinn@uni-tuebingen.de

claus-zinn commented 7 years ago

The issue has been resolved. For PDF, RTF and DOCX file, Apache Tika is used to convert the file to plain text. Language identification is then performed on the converted file.