Open thaliemuk opened 8 years ago
I checked the pdf, the text are rendered twice, at the exact same position.
Maybe we can detect and remove same text at the same position as pdftohtml
does
Coolwanglu: How did you determine that the text is rendered twice?
I am having this same problem, but I do not believe that the text is rendered twice in the pdf. I have taken my original pdf and stripped it down to a single page containing a single text-field (using Acrobat).
using: pdf2htmlEX --embed cfijo --split-pages 1 with pdf2htmlEX version 0.14.6 Copyright 2012-2015 Lu Wang coolwanglu@gmail.com and other contributors Libraries: poppler 0.33.0 libfontforge 20151218 cairo 1.13.1 Default data-dir: /usr/local/share/pdf2htmlEX Supported image format: png jpg svg
I've attached the pdf and the resulting ouput. pdf2HtmlEx-duplicate-text.zip
Thanks!
Update: There is definitely duplicate text in the original pdf. From Acrobat, I right-clicked the text field and selected Edit in Illustrator. In Ai I can see that each line of text is duplicated.
I would love to see duplicate text removed a la pdftohtml.
hello, I have searched the issues and could not find a similar one:
I have a pdf that after it's conversion, all of it's text is duplicated- there are two DIV elements in the DOM that has the same node hirarchy- and all of the text inside is duplicated. the converted html looks good, but any attempt to search a word in the html (ctrl+f) will result with two hits, even though there is only one occurrence in the html.
I tried to put all kinds of flags but non of them helped: tounicode, fallback, optimize-text etc...
attached the pdf and a print screen of the DOM after conversion and the search i did in the browser.
using windows ( happens also in centos7), chrome, this is the version i am using: pdf2htmlEX version 0.14.6 Copyright 2012-2015 Lu Wang coolwanglu@gmail.com and other contributors Libraries: poppler 0.33.0 libfontforge 20150621 cairo 1.12.18 Default data-dir: C:\t2k\cgs-data\utils\pdf2htmlex_dir/data Supported image format: png jpg svg
pdfWithDuplicateDiv.pdf