UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.57k stars 225 forks source link

File exception: UglyToad.PdfPig.Core.PdfDocumentFormatException' was thrown. #860

Open FinalFrontierPrototyping opened 6 days ago

FinalFrontierPrototyping commented 6 days ago

Hello,

I found this really nice project because I need to read and process many pdf files. (At the moment I am using V0.19-Alpha but also tested V0.18) The pdf file can be opened with adobe, however, when I want to read it with PdfPig an error is thrown:

Once in a while I get the following exception while reading a file: var document = PdfDocument.Open(fileEntry);

'Exception of type 'UglyToad.PdfPig.Core.PdfDocumentFormatException' was thrown.'

UglyToad.PdfPig.Core.PdfDocumentFormatException HResult=0x80131500 Message=Exception of type 'UglyToad.PdfPig.Core.PdfDocumentFormatException' was thrown. Source=UglyToad.PdfPig StackTrace: at UglyToad.PdfPig.Parser.FileStructure.CrossReferenceParser.Parse(IInputBytes bytes, Boolean isLenientParsing, Int64 crossReferenceLocation, Int64 offsetCorrection, IPdfTokenScanner pdfScanner, ISeekableTokenScanner tokenScanner) at UglyToad.PdfPig.Parser.PdfDocumentFactory.OpenDocument(IInputBytes inputBytes, ISeekableTokenScanner scanner, InternalParsingOptions parsingOptions)

Since the PDF files are confidential, I cannot share them. What can be the cause?

Thanks.

FinalFrontierPrototyping commented 6 days ago

I noticed that when I open the file, add one character to a field, save it and reprocess it, it gives no error?

FinalFrontierPrototyping commented 2 days ago

Anything I can provide in order to support you as efficient as possible? This issue is making my current tool non-functional because 5% of the PDF files cannot be processed.

EliotJones commented 2 days ago

Unfortunately this error can be due to basically any unexpected formatting in the source file. Without the source file it is very difficult to tell.

The error message suggests the error is happening when trying to find the information near the end of the document which looks like:

xref
0 103
0000000000 65535 f 
0000058002 00000 n 
0000000019 00000 n 
0000001903 00000 n 
0000058273 00000 n
...

It might be possible to get more information about the error locally by debugging the PdfPig code. You can clone this repository and locally set the version of .NET you have available with this script https://github.com/UglyToad/PdfPig/blob/master/tools/set-dotnet-version.ps1

Then you can load the file in a test and see what is going wrong: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.Tests/Integration/LocalTests.cs