UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.7k stars 241 forks source link

some characters not extracted by PdfPig #687

Open Hert79 opened 1 year ago

Hert79 commented 1 year ago

see this link: https://library.oapen.org/bitstream/20.500.12657/35248/1/340082.pdf

page index 324 (actual page number in the book is 315), the REFERENCES header. I can copy out the characters just fine from edge or acrobat reader (spacing is a bit wonky, but this is due to the header layout): re f ere n ces

but when PdfPig parses the document it returns me this:   f    n   

so it looks like it lost some info here.

JansXue commented 1 year ago

This is because the encoding of the missing character is in the private region of unicode, and there is no corresponding character in the private region of unicode. Some software can be correctly copied because of compatibility processing, but this compatibility processing does not meet the PDF specification.

BobLd commented 1 year ago

To @JansXue point, the problematic unicode characters are indeed marked as "Private Use" when created in CharacterMapBuilder.CreateStringFromBytes(byte[] bytes)

I'm still not sure why, but some character 1st byte have a value of 247 instead of 0 as part of the 2 bytes array, and updating the CharacterMapBuilder.CreateStringFromBytes() as follow yields the correct values (see implemetatinon with test https://github.com/UglyToad/PdfPig/tree/687-some-characters-not-extracted-by-pdfpig):

private static string CreateStringFromBytes(byte[] bytes)
{
    if ( bytes.Length == 1)
    {
        return OtherEncodings.BytesAsLatin1String(bytes);
    }

    string unicode = Encoding.BigEndianUnicode.GetString(bytes);

    if ( CharUnicodeInfo.GetUnicodeCategory(unicode, 0) == UnicodeCategory.PrivateUse)
    {
        byte[] compat = bytes.ToArray();
        compat[0] = 0; // This value is 247 instead of 0

        unicode = Encoding.BigEndianUnicode.GetString(compat);

        if (CharUnicodeInfo.GetUnicodeCategory(unicode, 0) == UnicodeCategory.PrivateUse)
        {
            // check again
            // Process further if need be
        }
    }

    return unicode;
}

@JansXue do you have more information as to how compatibility processing is done in other software? I'd like PdfPig to support that even if that does not meet the PDF specification, as Acrobat Reader itself handle these properly

JansXue commented 1 year ago

@BobLd Sorry, I don't know the compatibility processing details, I just speculate based on the results. I tested pdf.js, pdfbox, and itext, but none of them were compatible, while edge and wps were compatible. I just looked at the source code of pdfium, it seems that the valid unicode range is predefined first, and then it detects whether the character is in this range. Ignore the 1st byte if it's not in range, but since it's a C++ programming language, I'm not familiar with that language. Where to get unicode: https://github.com/chromium/pdfium/blob/4ae353f1e22efea86262f9cdd4f0e8478f142182/core/fpdftext/cpdf_textpage.cpp#L80 Where unicode is defined: https://github.com/chromium/pdfium/blob/4ae353f1e22efea86262f9cdd4f0e8478f142182/core/fpdftext/unicodenormalizationdata.cpp

EliotJones commented 1 year ago

Will look into this further next week but it seems like the Adobe Glyph List has some mappings in this reserved area that correspond to the characters in the PDF https://scripts.sil.org/cms/scripts/page.php?item_id=PUAinAdobeGlyphList

For instance the first character is U+F772 which is mapped to Rsmall in the list. This mapping is also in the glyphlist file, I'd have to follow through the mapping to see where it should be consumed.

EliotJones commented 1 year ago

Finally got PDFBox running on my current machine. Looks like whatever correction Pdfium and Adobe are doing isn't done by PDFBox or PDF.js, here's PDFBox content around the problematic text:

The ‘weapon graves’ are possibly those of new ancestors that in some instances 
– but not all – underline those claims. As is usually the case, new interpretations engender new questions. 
I have tried to emphasise the fact that the burial rites of Northern Gaul in late Roman times are as rich 
a source for the study of norms, values and ideas as those of other societies subjected to anthropological 
study and that to interpret them as representing primarily ethnic identities is too one-dimensional.
  f    n   
Ament, H., 1978: Romanen an Rhein und Mosel im frühen Mittelalter, Archäologische Bemühungen 
EliotJones commented 1 year ago

Rsmall exists in both glyphlist.txt and in CFF string identifiers file but nowhere can I find a remapping to unicode other than https://github.com/deepin-community/lcdf-typetools/blob/master/texglyphlist-g2u.txt#L178 which seems to suggest this unicode PUA to unicode remapping isn't specified anywhere, it is just added as a workaround