coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.36k stars 1.84k forks source link

Copy & pasted text does not match displayed text #659

Open JRaspass opened 8 years ago

JRaspass commented 8 years ago

The attached PDF is a minimal test case. bug.pdf

When run with no flags, this warning is produced:

ToUnicode CMap is not valid and got dropped for font: 1

and the outputted HTML looks correct, but pastes with wrong values like:

 for the string Lorem ipsum dolor

When run with the --tounicode 1 flag, this warning is produced:

Warning: encoding confliction detected in font: 1

and the outputted HTML looks and pastes fine with the exception of the last three words which feature ligatures on ff, fi, and fl. They paste as:

Eective condent air for the string Effective confident flair

Evince has no problems rendering and copy/pasting from this PDF.

The PDF was exported from LibreOffice 5.1 if that helps.

timretout commented 8 years ago

Started to debug this - here's the mapping from the CMap for that test file:

<01> <0020>
[...snip...]
<1D> <006600660069>
<1E> <00660066>
<1F> <00660069>
<20> <0066006C>

The problem occurs with the final char, which is the ligature "fl", i.e. http://graphemica.com/%EF%AC%82

When getCharName is called in unicode_from_font, it returns "space" as the name of this char. This is then mapped back to 0x20, and this collides with the very first character in the CMap.

Unicode unicode_from_font (CharCode code, GfxFont * font)
{
    if(!font->isCIDFont())
    {
        char * cname = dynamic_cast<Gfx8BitFont*>(font)->getCharName(code);
        if(cname)
        {
            Unicode ou = globalParams->mapNameToUnicodeText(cname);
            if(!is_illegal_unicode(ou))
                return ou;
        }
    }

    return map_to_private(code);

All the other ligatures managed to avoid the "if(cname)" block, and reached "map_to_private(code)".

I'm not clear yet whether this is therefore a bug in poppler?

timretout commented 8 years ago

I manually created some pdfs in LibreOffice, and I only see the problems with the font "DejaVu Sans". There are two issues:

Edit: DejaVu fonts happen to be on github and are a freedesktop.org project. The test pdf used version 2.35 of that font, but I can reproduce these issues with version 2.37. My copy of evince (i.e. poppler) renders all my test PDFs fine, and can search/copy the text correctly, so I think there must legitimately be a pdf2htmlEX bug with how ligatures are expanded.

timretout commented 8 years ago

This hack avoids the first problem, but doesn't fix the second:

--- a/src/util/unicode.cc
+++ b/src/util/unicode.cc
@@ -38,7 +38,7 @@ Unicode map_to_private(CharCode code)

 Unicode unicode_from_font (CharCode code, GfxFont * font)
 {
-    if(!font->isCIDFont())
+    if(!font->isCIDFont() && font->getName()->getChar(6) != '+')
     {
         char * cname = dynamic_cast<Gfx8BitFont*>(font)->getCharName(code);
         // may be untranslated ligature

This works on the basis that it doesn't make sense to map the names of subsetted fonts, and those fonts will always have six random characters and then a plus sign in the name. There's probably a better way to do that...

timretout commented 8 years ago

@coolwanglu This PDF and derived HTML demonstrates the second problem more precisely, sidestepping the first bug (so note that the "ToUnicode CMap is not valid and got dropped" only happens when a ligature gets mapped to 0x20?):

659-deja-vu-sans.pdf 659-deja-vu-sans.html.txt

The generated HTML contains: e4ective con5dent e6cient 7air e8uent, because the ligatures ended up in the normal places for those digits. [EDIT: But when viewed in a browser, this looks fine, because the digits are mapped to the ligatures by the font. I guess the ideal would be to decompose the ligatures in the HTML, but still have the font display the ligatures in the browser, if that's possible.]

Passing "--decompose-ligature 1" does avoid this issue, but then ligature glyphs are not shown at all in the HTML output.

timretout commented 8 years ago

This appears to be a separate issue to #459, because in the generated output on that issue, I can see "fi" in the HTML - I don't think that used a subsetted font. @lano1106 mentioned these ligature problems on that report, though.

timretout commented 8 years ago

Comparing the subsetted "f1.ttf" font left in /tmp with the original DejaVu-Sans.ttf in fontforge, I can see that the substitution tables have been left out when embedding the font in the PDF. To get ligatures to work in the WOFF version, we'd need to recreate the necessary substitutions, I think, by looking at the decomposed unicode chars provided in the CMap.

The PDF's CMap does not provide a way to get the original unicode codepoints "U+FB00" etc. for the ligatures. I've also found references online to fonts such as Carlito, which provide a "ti" ligature with no standard unicode codepoint for the glyph. So this suggests to me that we won't have much luck trying to place the ligatures in their "correct" codepoints and using these directly in the HTML - rather, the ligatures ought always to be decomposed in the HTML, and the font itself should be upgraded to handle the substitution.

Here's an example PDF with Carlito "ti" and "tti" ligatures, displaying similar problems: carlito-ligatures.pdf