coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.35k stars 1.84k forks source link

converted html contains duplicate text #613

Open thaliemuk opened 8 years ago

thaliemuk commented 8 years ago

hello, I have searched the issues and could not find a similar one:

I have a pdf that after it's conversion, all of it's text is duplicated- there are two DIV elements in the DOM that has the same node hirarchy- and all of the text inside is duplicated. the converted html looks good, but any attempt to search a word in the html (ctrl+f) will result with two hits, even though there is only one occurrence in the html.

I tried to put all kinds of flags but non of them helped: tounicode, fallback, optimize-text etc...

attached the pdf and a print screen of the DOM after conversion and the search i did in the browser.

using windows ( happens also in centos7), chrome, this is the version i am using: pdf2htmlEX version 0.14.6 Copyright 2012-2015 Lu Wang coolwanglu@gmail.com and other contributors Libraries: poppler 0.33.0 libfontforge 20150621 cairo 1.12.18 Default data-dir: C:\t2k\cgs-data\utils\pdf2htmlex_dir/data Supported image format: png jpg svg

duplicatesearch

duplicatetext pdfWithDuplicateDiv.pdf

coolwanglu commented 8 years ago

I checked the pdf, the text are rendered twice, at the exact same position. Maybe we can detect and remove same text at the same position as pdftohtml does

asinning commented 8 years ago

Coolwanglu: How did you determine that the text is rendered twice?

I am having this same problem, but I do not believe that the text is rendered twice in the pdf. I have taken my original pdf and stripped it down to a single page containing a single text-field (using Acrobat).

using: pdf2htmlEX --embed cfijo --split-pages 1 with pdf2htmlEX version 0.14.6 Copyright 2012-2015 Lu Wang coolwanglu@gmail.com and other contributors Libraries: poppler 0.33.0 libfontforge 20151218 cairo 1.13.1 Default data-dir: /usr/local/share/pdf2htmlEX Supported image format: png jpg svg

I've attached the pdf and the resulting ouput. pdf2HtmlEx-duplicate-text.zip

Thanks!

asinning commented 8 years ago

Update: There is definitely duplicate text in the original pdf. From Acrobat, I right-clicked the text field and selected Edit in Illustrator. In Ai I can see that each line of text is duplicated.

I would love to see duplicate text removed a la pdftohtml.