UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.73k stars 241 forks source link

Missing character #688

Closed JansXue closed 9 months ago

JansXue commented 1 year ago

https://github.com/UglyToad/PdfPig/blob/4a480ffd7f30b35e42fc9d664c0891a5344bfae4/src/UglyToad.PdfPig/Graphics/ContentStreamProcessor.cs#L443 Because StringTokenizer determines Encoding, in special cases there are only two codes, 0xFE and 0xFF, StringToken's Data will be Empty, and if the font is Type1C, it will cause the character to be lost here. So Mebe should using StringToken.GetBytes() directly instead of OtherEncodings.StringAsLatin1Bytes(((StringToken)token).Data)

BobLd commented 1 year ago

Hi @JansXue do you have a document example for that?

JansXue commented 1 year ago

Indexed-DeviceRGB-JPXDecode-0-[0.0,255.0]-Font-F1_1_missing_char_255.pdf missing_char The characters in the red box in the picture are the missing characters

JansXue commented 1 year ago

@BobLd FontName T1_1 missing_char2

BobLd commented 1 year ago

@JansXue thanks a lot for the document and details, will look into that shortly.

EDIT: Also happy for you to create a PR (with unit tests would be amazing) with the change

JansXue commented 1 year ago

@BobLd Sorry, my current network environment is temporarily unable to push code to github, if I can push code later, I will be happy to submit PR

BobLd commented 1 year ago

@JansXue no worries, I'll try to look into that when I have time (happy for someone else to look into that too)

BobLd commented 9 months ago

Closing issue as fixed in #763