coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.39k stars 1.84k forks source link

Transliteration #611

Open rubi-l opened 8 years ago

rubi-l commented 8 years ago

This is not strictly an pdf2htmlEX issue, but it might be considered as a feature request...

I'm working on a large number of .pdf files, and a lot of them is in Serbian language and Cyrillic scripture. I need the documents transliterated to Latin scripture, with the same formatting - html being the perfectly acceptable file format. For simple .pdf documents, the following works perfectly:

pdf2htmlEX ifile.pdf ofile.html translit -i FILE -o FILE -t 'ISO/R 9'

But for more complex .pdf files, the transliteration step (done by liblingua-translit-perl ) messes up the formating.

Is there a way to instruct pdf2htmlEX to use a specific unicode/UTF8 transliteration table, so the transliteration can be done in one step? For variable width fonts, there are slight differences between Cyrillic and Latin letters. To make matters a bit more complicated, some Cyrillic letters are replaced by two Latin symbols. Also, is there a way to automatically prune the text from invisible characters and soft hyphens? '--space-as-offset 1' seemed to do the trick in the files I have encountered so far, but I'm not sure if it's a universal solution.

coolwanglu commented 8 years ago

I'm not familiar with how transliteration works, you might want to take a look at HTMLRenderer/text.cc, or HTMLTextLine.cc where the characters are collected and rendered.

And do you know why translit mess up the formatting?

rubi-l commented 8 years ago

On 02/14/2016 04:18 PM, Lu Wang wrote:

I'm not familiar with how transliteration works, you might want to take a look at HTMLRenderer/text.cc, or HTMLTextLine.cc where the characters are collected and rendered.

And do you know why translit mess up the formatting?


Reply to this email directly or view it on GitHub: https://github.com/coolwanglu/pdf2htmlEX/issues/611#issuecomment-183904508

Can't tell... The width of the text areas is excessive after transliteration, and it doesn't seem to have something to do with the width of the glyphs themselves. I'll try to replicate the issue with a publicly available .pdf, and upload the example.