documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
832 stars 214 forks source link

Corrupted pdf file from Chinese docx #122

Closed intellisense closed 9 years ago

intellisense commented 9 years ago

This is the Chinese.docx which is converted to Chinese.pdf and the pdf is completely corrupted it does not show the Chinese characters at all, instead some square boxes.

Environment details:

Ubuntu 14.01
docsplit 0.7.5
tesseract 3.03
Libreoffice 4.2.6.3 420m0(Build:3)
nathanstitt commented 9 years ago

As you've noted, Docsplit uses LibreOffice internally to convert non-pdf documents to pdf. The square boxes are a symptom of LibreOffice not finding an appropriate font to display the Chinese UTF characters.

You can verify if that's the issue by using LibreOffice to open the documents and see if they are displayed properly.

We've had a related issue https://github.com/documentcloud/docsplit/issues/95 that was solved by installing the Ubuntu packages: ttf-wqy-microhei ttf-wqy-zenhei ttf-kochi-gothic ttf-kochi-mincho fonts-nanum ttf-baekmuk

intellisense commented 9 years ago

Thanks this has solved the problem. Sorry I didn't find the issue #95 earlier.