UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.76k stars 242 forks source link

Unable to read the file. Throwing error "Value was either too large or too small for an Int32" #405

Closed securedoccheck closed 2 years ago

securedoccheck commented 2 years ago

Hi,

Thanks for providing us with this great tool.

I'm currently using v 0.1.5 and having an issue when we try to open certain PDF files. Most of the PDF work fine but a small number throws the below error stack.

_at System.Decimal.ToInt32(Decimal d) at UglyToad.PdfPig.Tokens.NumericToken.getInt() at UglyToad.PdfPig.Encryption.EncryptionDictionaryFactory.Read(DictionaryToken encryptionDictionary, IPdfTokenScanner tokenScanner) at UglyToad.PdfPig.Parser.PdfDocumentFactory.ParseTrailer(CrossReferenceTable crossReferenceTable, Boolean isLenientParsing, IPdfTokenScanner pdfTokenScanner, EncryptionDictionary& encryptionDictionary) at UglyToad.PdfPig.Parser.PdfDocumentFactory.OpenDocument(IInputBytes inputBytes, ISeekableTokenScanner scanner, ILog log, Boolean isLenientParsing, IReadOnlyList`1 passwords, Boolean clipPaths) at UglyToad.PdfPig.Parser.PdfDocumentFactory.Open(IInputBytes inputBytes, ParsingOptions options) at UglyToad.PdfPig.PdfDocument.Open(Byte[] fileBytes, ParsingOptions options)

The above error happen on the below line of code which tries to open the file using bytes or Stream.

_PdfDocument document = PdfDocument.Open(verifyFile.fileByte, parOpt);_

Please can you help look in to this. Really appreciate your help!

Thanks, Andy

EliotJones commented 2 years ago

Hi Andy,

Are you able to share a file that causes this problem please? I'm assuming there's some token that we don't support properly yet but it will be file specific so hard to diagnose without the source file.

If not you can open the PDF in Notepad++ or similar and find the contents of a line like:

<</Root 22 0 R/Info 1 0 R/Encrypt 55 0 R/ID[<462216FABF3B2DFEA6DBF82A292C4BDB><CECB22F1C4A0E419EC8F65D520C078DE>]/Size 56>>

The most important parts here being /Root and /Encrypt XX YY R. Then you need to find the corresponding line containing XX YY obj, so in the previous example 55 0 obj: Then let me know the content of the bit between double angle brackets <<:

55 0 obj<</Filter/Standard/U(tºln¼éú8äÄŸVË?YrA¢A͐ÙªÎï^Aˆõ̳¢`<Åùa½âæø‡)/O(ípCeœ:qÂ*w¸>Ê÷žXýMcqOÏ[„¡jA!Ë>R2DÐ W4Y\)Ô\) )/P -1028/Length 256/R 6/EncryptMetadata true/UE(ãŸp”ÓÕݝã5y\)”¶u‡éŽ=Þ2èêÛæ"Ÿ†)/OE(ÿƒ„pïãwCƒeQŸ`µEyq¯÷A§Uß,[Û£¥)/Perms(ÚKÇ©;OWŽQHXåPI,)/CF<</StdCF<</Length 32/AuthEvent/DocOpen/CFM/AESV3>>>>/StmF/StdCF/StrF/StdCF/V 5>>
endobj
securedoccheck commented 2 years ago

Hi Eliot,

Below are the lines from the pdf. I'll also try to attach the file in a day. I have few PII info on that which cannot be shared as it is.

trailer << /Size 18 /Info 2 0 R /Root 17 0 R /ID[ <30383134353431332d303046462d343645462d423846462d453046383732454241373438> <30383134353431332d303046462d343645462d423846462d453046383732454241373438> ] /Encrypt 1 0 R

Second part:

1 0 obj << /Filter /Standard /V 1 /Length 40 /R 2 /O <2055c756c72e1ad702608e8196acad447ad32d17cff583235f6dd15fed7dab67> /U /P 4294967292 >> endobj

securedoccheck commented 2 years ago

Second example:

<</DecodeParms<</Columns 3/Predictor 12>>/Encrypt 8 0 R/Filter/FlateDecode/ID[<43383843373044462D464130422D343836322D424546422D313645383545414139334442><849BD7D0EF733C4583465A492531B962>]/Info 6 0 R/Length 37/Root 9 0 R/Size 7/Type/XRef/W[1 2 0]>>stream hÞbb``bœû†‰ßŽ‰¡—‰ñ;cpÍ :™ endstream endobj startxref 116 %%EOF

Second part:

8 0 obj <</Filter/Standard/Length 40/O( UÇVÇ.×`Ž–¬­DzÓ-Ïõƒ#_mÑ_í}«g)/P 4294967292/R 2/U(¼ÑYÍ~7…œž¾pðþƒ8ÐÿkñB²J¦ºqÈ ‰¬)/V 1>> endobj

securedoccheck commented 2 years ago

Attaching the sample file Sample_file.pdf

EliotJones commented 2 years ago

@securedoccheck thanks for providing that, can you give this version a go and see if it resolves the problem? https://www.nuget.org/packages/PdfPig/0.1.6-alpha-20220111-41bfa

securedoccheck commented 2 years ago

@EliotJones, thanks for the fix and I tested the same and it has resolved the original issue.

I was validating few other files for testing purpose but stepped on to two other issues.

at UglyToad.PdfPig.Parser.PdfDocumentFactory.ParseTrailer(CrossReferenceTable crossReferenceTable, Boolean isLenientParsing, IPdfTokenScanner pdfTokenScanner, EncryptionDictionary& encryptionDictionary) at UglyToad.PdfPig.Parser.PdfDocumentFactory.OpenDocument(IInputBytes inputBytes, ISeekableTokenScanner scanner, ILog log, Boolean isLenientParsing, IReadOnlyList`1 passwords, Boolean clipPaths) at UglyToad.PdfPig.Parser.PdfDocumentFactory.Open(IInputBytes inputBytes, ParsingOptions options) at UglyToad.PdfPig.PdfDocument.Open(Byte[] fileBytes, ParsingOptions options)

Unrecognized encryption token in trailer: null.


Second Issue

Expected name as dictionary key, instead got: {ì%,©¹ŽêjXåvFVÕXðiÙrGÍbyçŸ3Þò

at UglyToad.PdfPig.Tokenization.DictionaryTokenizer.ConvertToDictionary(List1 tokens) at UglyToad.PdfPig.Tokenization.DictionaryTokenizer.TryTokenize(Byte currentByte, IInputBytes inputBytes, IToken& token) at UglyToad.PdfPig.Tokenization.Scanner.CoreTokenScanner.MoveNext() at UglyToad.PdfPig.Tokenization.ArrayTokenizer.TryTokenize(Byte currentByte, IInputBytes inputBytes, IToken& token) at UglyToad.PdfPig.Tokenization.Scanner.CoreTokenScanner.MoveNext() at UglyToad.PdfPig.Tokenization.ArrayTokenizer.TryTokenize(Byte currentByte, IInputBytes inputBytes, IToken& token) at UglyToad.PdfPig.Tokenization.Scanner.CoreTokenScanner.MoveNext() at UglyToad.PdfPig.Tokenization.ArrayTokenizer.TryTokenize(Byte currentByte, IInputBytes inputBytes, IToken& token) at UglyToad.PdfPig.Tokenization.Scanner.CoreTokenScanner.MoveNext() at UglyToad.PdfPig.Parser.FileStructure.FileHeaderParser.Parse(ISeekableTokenScanner scanner, Boolean isLenientParsing, ILog log) at UglyToad.PdfPig.Parser.PdfDocumentFactory.OpenDocument(IInputBytes inputBytes, ISeekableTokenScanner scanner, ILog log, Boolean isLenientParsing, IReadOnlyList1 passwords, Boolean clipPaths) at UglyToad.PdfPig.Parser.PdfDocumentFactory.Open(IInputBytes inputBytes, ParsingOptions options) at UglyToad.PdfPig.PdfDocument.Open(Byte[] fileBytes, ParsingOptions options)

securedoccheck commented 2 years ago

For the first error, below is the data/stream from pdf:

/Type /XRef /Root 8 0 R /Prev 116 /Length 84 /Size 35 /W [1 3 2] /Index [0 1 6 1 8 2 25 10] /ID [ (ù¸7ãAםžòÜ4Š•)] /Info 6 0 R /Encrypt null

stream ÿÿ ‘J  ‘ô  ¢&  ’Ô  ¡÷  ¢  ¡µ  £~  £—  ¢  “  ¤  ž½
endstream endobj startxref 695997 %%EOF

securedoccheck commented 2 years ago

For the second error:

trailer <</Info 11 0 R/ID [<12c3f89648e55841eb5ff9a221b07f4b>]/Root 10 0 R/Size 12>> startxref 109774 %%EOF

Looks like there is no encrypt section or <</DecodeParams section with in the pdf.

EliotJones commented 2 years ago

@securedoccheck there should be a new NuGet shipping at midnight UTC with the fix for the first of the 2 issues. On the second it looks like the error is at the file start, can you copy a few lines from the start of the file, probably looking like:

%PDF-1.6
%âãÏÓ
7 0 obj
<</Linearized 1/L 7259/O 10/E 2883/N 1/T 6916/H [ 504 132]>>
endobj

13 0 obj

If you copy down as far as the first dictionary (<< ... >>) occurrence that looks to be the problem.

securedoccheck commented 2 years ago

¬í ur [Ljava.lang.Object;ÎXŸs)l xp sr java.util.HashMapÚÁÃ`Ñ F loadFactorI thresholdxp?@ w  t ETagt %W/"1ae56-SgR20izELlBssSALl/6suyfyfqw"t Access-Control-Allow-Credentialst truet Connectiont keep-alivet Content-Lengtht 110166t Access-Control-Allow-Headerst .Origin, X-Requested-With, Content-Type, Acceptt Datet Wed, 05 Jan 2022 10:50:21 GMTt X-Powered-Byt Expresst Content-Typet application/pdfxur [B¬óøTà xp ®V%PDF-1.4 %âãÏÓ

securedoccheck commented 2 years ago

I can see what you are saying, it may be the issue with the way the PDF was created. I was able to read the document, if I remove the extra content before %PDF-1.4.

Thanks Eliot. Really appreciate the help!

EliotJones commented 2 years ago

Closing this since I think it was resolved, let me know if you encounter any issues.