Open simonedd opened 3 years ago
As far as I could tell, it seems like those words are in the pdf somehow. I would suggest trying another library and compare to confirm that such text appears or not. The text could be hidden behind the images for instance.
You're right, I have the same wrong characters with another library, tested with iText 7. But I don't see any images in the file and opening it with Adobe the text looks like text. Could there be something else covering the word? In the following pictures, you can see an example of a word that looks not correct.
Out of curiosity, how did you render the second image? And what are these red boxes? (I guess the first one is a screenshot from Acrobat reader)
I'm rendering it with SharpDX. Red boxes are the GlyphRectangle of the first letter of each word, as you can see I have some problem with text size, is a little bigger.
Copying text from the PDF inside both Firefox and Edge (Chromium) copies these letters in both browsers which suggests they are genuinely in the content of the document. It may be the case that they are overlaid by the rectangle boxes of the annotations.
This flags a potential need to tell which paths appear above/below each other and glyphs.
It might be totally wrong, but I also noticed the presence of "optional content" on the pdf, and I wondered if this had to do with it. The optio al content itself is unrelated to the issue (other correct text), but I had never crossed that before.
Yes, copying from the browser I get these characters, but I don't get any annotation with Page.ExperimentalAccess.GetAnnotations()
Hi,
I have another file with the same issue of wrong characters, e.g. "ĭZ\P:". I don't see any annotations that cover the text so I guess there is something wrong with the characters decoding. JWO_IS_eksportpdf..pdf
This also applies to form fields. It seems to be an issue with extended character sets. For example in one of the fields
"Formulaire de déclaration obligatoire des incidents" accented or apostrophe are rendered strangely. When I output the characters to json in my app, somehow they do get escaped to unicode like this: "Formulaire de d\u00E9claration obligatoire des incidents"
Hi,
When I read the attached files I get some strange wrong characters, like DĂĕŽŶŶĞƌŝĞĞŶĚƵŝƚĞ. I tested it with the AdvancedTextExtraction sample. Facade.pdf
Thanks, Simone