coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.39k stars 1.84k forks source link

Colored boxes in background image and font doesn't fit sometimes #532

Open JensHH opened 9 years ago

JensHH commented 9 years ago

When I convert my PDF sometimes I can not see the text, because there is a box with the same color in the background image http://liedtke.it/pdf2htmlEX/out2/AM.html. When I convert it with mediafire.com http://www4.mediafire.com/conversion_server.php?9745&quickkey=2uge979qtidxawm&output=html&doc_type=d&metadata=0&page=131&initial=0&timestamp=1432986621&version=113354&domain=mediafire.com there is the text in the background image, sometimes with "real" text sometimes without. Is this a question of which version I use or which parameter? Is there a way to get a clean background image, because in my case the text is ok.

There is a second problem with the last page. The pdf has 132 pages. If I convert just the last 2 pages the last page is ok. If I convert the last 3 pages the font doesn't fit correct 2 pages: http://liedtke.it/pdf2htmlEX/out2/AM.html 3 pages: http://liedtke.it/pdf2htmlEX/out3/AM.html.

You can find the original pdf here http://liedtke.it/pdf2htmlEX/AM.pdf I convert with: pdf2htmlEX.exe -f 130 -l 132 --fit-width 600 AM.pdf AM3.htm --bg-format jpg

I am using a windows version from http://soft.rubypdf.com/software/pdf2htmlex-windows-verion pdf2htmlEX version 0.12 Copyright 2012-2014 Lu Wang Libraries: poppler 0.26.3 libfontforge 20140516 cairo 1.12.14

coolwanglu commented 9 years ago

Could you try the latest git commit? It might have been fixed since 0.12.

JensHH commented 9 years ago

I really would like to try it. My problem is I have windows and don't know how to compile the source and which libraries I need. Sorry, the last time I did this is 20 years ago (yes, I read the chapter about building). I also have some servers but I am not sure if my providers let me do this and what kind of *nix they use. And I have no knowlege about Linux itself. I have SSH, I can install gnuC on windows and knowlege about PHP and MySQL, so I'm not a beginner. I really would like to support this project because it is amazing and we have the chance to work without PDF, which would be a huge progress in Website programming, what is my business. I just need a bit help for the start.

coolwanglu commented 9 years ago

@JensHH I see. I'll try to reproduce it on my machine.

JensHH commented 9 years ago

When you try it, can you please test if you can convert all 132 of my PDF at once. The windows version is crashing without any comment. I can convert the first 131 pages and the last 3. But not all 132 or the last 4.

JensHH commented 9 years ago

A friend has compiled the programm for a rasberry pi and it worked without crashing. Instead of the blocks there is now text (sometimes only a few letters) sometimes in the background. Is it possible to get only the background image without text?

duanyao commented 9 years ago

Try --correct-text-visibility 1, and read man for more info.