documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.com/docsplit/
Other
832 stars 214 forks source link

conversion to PDF mangles non-ASCII characters in docx on Linux #95

Closed bobmyers closed 9 years ago

bobmyers commented 10 years ago

My documentation management app involves converting a .docx file containing non-ASCII Unicode characters (Japanese) to PDF with docsplit (via the Ruby gem, if it matters). It works fine on my Mac. On my Ubuntu machine, the resulting PDF has square boxes where the characters should be, whether invoked through Ruby or directly on the command line. The odd thing is, when I open up the .docx file directly in LibreOffice and do a PDF export, it works fine. So it would seem there is some aspect to how docsplit invokes LO that causes the Unicode characters to be handled improperly. I have scoured various parts of the documentation and code for options that I might need to specify, with no luck. Any ideas of why this could be happening?

seikoudoku2000 commented 10 years ago

I had similar problem and this worked for me, some fonts are lacked on Ubuntu machine by default. Not sure it's the same problem though.

$ sudo aptitude install ttf-kochi-gothic ttf-kochi-mincho  ttf-sazanami-gothic ttf-sazanami-mincho
nathanstitt commented 9 years ago

The "Kochi" font has been removed from Ubuntu packages: https://launchpad.net/ubuntu/lucid/i386/ttf-kochi-gothic states the maintainer recommends using "sazanami" instead, which @seikoudoku2000 has included in the aptitude command.

knowtheory commented 9 years ago

@nathanstitt wanna link to (or just list) the set of fonts that we identified?

nathanstitt commented 9 years ago

Sure! I added the fonts to our build script at https://github.com/documentcloud/documentcloud/commit/05ac44744232845f4ed5d0737ec1148707d26575

The added Ubuntu packages are: ttf-wqy-microhei ttf-wqy-zenhei ttf-kochi-gothic ttf-kochi-mincho fonts-nanum ttf-baekmuk

That combination enables LibreOffice to support Chinese, Japanese, and Korean documents.

I also attempted to use the Google all-in-one font noto but that didn't enable LibreOffice to support Chinese documents. I installed it both via the Ubuntu fonts-noto package and by downloading the ttf files and installing them manually. Not sure why it did not, since my reading of it's documentation suggested that it should have worked well.