coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.35k stars 1.84k forks source link

Separate characters converted to single word with huge letter-spacing #453

Open dbdr opened 9 years ago

dbdr commented 9 years ago

Using v0.12 built from git, testcase: https://shared.chemaxon.com/users/dbonniot/pdf2htmlEx/2014_Mahalingam_eIF4_inhibitors_p7.pdf

(There is a visual artifact for line 6 of the table on Chromium, but this looks suspiciously like #416, and is not the issue here. Looks good in Firefox)

The issue is the generated elements for line 6 (bottom half page). In particular the columns 3 and 4 ("O" and "Cl"). The generated HTML is:

<span class="lse">
OC[...]
</span>

with CSS:

.lse {
    letter-spacing: 162.078px;
}

This is fine graphically, however it is very surprising to have "OC" as a single word, with only the letter-spacing to visually separate them. This means in particular that selection does not work as expected. Double-click ("select word") on O selects OC instead of just O. It is also semantically wrong: given the large space between them, they definitely belong to separate words.

A further consequence is that the other elements in that span need large negative margins, such as margin-left: -162.078px;, I suppose to counterbalance this effect. This seems counterproductive for the file size, and possibly what is triggering #416 in this case.

What about keeping a reasonable limit to how large letter-spacing can be, and use separate elements over that limit?

coolwanglu commented 9 years ago

I cannot access the link. I got a "403 Forbidden" error.

Usually letter spacing comes from the original PDF file, unless text optimization is turned on.