UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.72k stars 240 forks source link

wrong characters #286

Open simonedd opened 3 years ago

simonedd commented 3 years ago

Hi,

When I read the attached files I get some strange wrong characters, like DĂĕŽŶŶĞƌŝĞĞŶĚƵŝƚĞ. I tested it with the AdvancedTextExtraction sample. Facade.pdf

Thanks, Simone

Poltuu commented 3 years ago

As far as I could tell, it seems like those words are in the pdf somehow. I would suggest trying another library and compare to confirm that such text appears or not. The text could be hidden behind the images for instance.

simonedd commented 3 years ago

You're right, I have the same wrong characters with another library, tested with iText 7. But I don't see any images in the file and opening it with Adobe the text looks like text. Could there be something else covering the word? In the following pictures, you can see an example of a word that looks not correct. image image

BobLd commented 3 years ago

Out of curiosity, how did you render the second image? And what are these red boxes? (I guess the first one is a screenshot from Acrobat reader)

simonedd commented 3 years ago

I'm rendering it with SharpDX. Red boxes are the GlyphRectangle of the first letter of each word, as you can see I have some problem with text size, is a little bigger.

EliotJones commented 3 years ago

Copying text from the PDF inside both Firefox and Edge (Chromium) copies these letters in both browsers which suggests they are genuinely in the content of the document. It may be the case that they are overlaid by the rectangle boxes of the annotations.

This flags a potential need to tell which paths appear above/below each other and glyphs.

Poltuu commented 3 years ago

It might be totally wrong, but I also noticed the presence of "optional content" on the pdf, and I wondered if this had to do with it. The optio al content itself is unrelated to the issue (other correct text), but I had never crossed that before.

simonedd commented 3 years ago

Yes, copying from the browser I get these characters, but I don't get any annotation with Page.ExperimentalAccess.GetAnnotations()

simonedd commented 3 years ago

Hi,

I have another file with the same issue of wrong characters, e.g. "ĭZ\P:". I don't see any annotations that cover the text so I guess there is something wrong with the characters decoding. JWO_IS_eksportpdf..pdf

dkiltyhc commented 2 years ago

This also applies to form fields. It seems to be an issue with extended character sets. For example in one of the fields

"Formulaire de déclaration obligatoire des incidents" accented or apostrophe are rendered strangely. When I output the characters to json in my app, somehow they do get escaped to unicode like this: "Formulaire de d\u00E9claration obligatoire des incidents"

md-mm_form-fra-test1.pdf s