UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.73k stars 241 forks source link

PdfDocument.GetPage "Could not find the object number 7 0 with type StreamToken." #317

Closed M4nju closed 1 year ago

M4nju commented 3 years ago

Hello there,

i get an exception trying to get the first page of a specific document. The error Message is "Could not find the object number 7 0 with type StreamToken.". I Updated to the newest PreRelease of Pdf Pig (0.1.5-alpha001), but sadly the issue still exists.

I am not able to upload the document, because it contains sensitive Information. I redacted the file, but then it doesn't throw the exception anymore. If I can give you any other information that would help you to detect the issue please ask. Its a very specific case in only one Pdf File (Microsoft-Invoice).

Stack-Trace:

   at UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.Get[T](IndirectReference reference, IPdfTokenScanner scanner)
   at UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.Get[T](IToken token, IPdfTokenScanner scanner)
   at UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean clipPaths)
   at UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber, Boolean clipPaths)
   at UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber)
   at ConsoleApp14.Program.Main(String[] args) in E:\VisualStudio Projekte\ConsoleApp14\ConsoleApp14\Program.cs:line 25

Used Source-Code:

                using (UglyToad.PdfPig.PdfDocument document =
                    UglyToad.PdfPig.PdfDocument.Open(@"E:\Desktop\test.pdf"))
                {
                    UglyToad.PdfPig.Content.Page page = document.GetPage(0 + 1);
                }

Any Information would help, thanks.

Kind Regards, Manju

InusualZ commented 3 years ago

The exception that you are getting it seems to suggest that your PDF document is malformed. Since you can't share the PDF, could you test document with this UglyToad.PdfPig.0.1.5-alpha001.zip

I should clarify, that it should still throw an exception. I'm more interested in the error message.

M4nju commented 3 years ago

I am realy sorry that i didnt responded, somehow missed the github email. I will test it today and come back to you with the error message.

M4nju commented 3 years ago

@InusualZ Okay i wasnt able to import the nuget Package u attached here as it is missing the reference to PdfPig.Core. So i downloaded the source Code from this Repository and built it. I got some additional error messages now that may help. I will also try to investigate whether i can find the issue. image

image

The Error Message stays the same with this StackTrace:

   bei UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.Get[T](IndirectReference reference, IPdfTokenScanner scanner) in E:\Desktop\PdfPig-0.1.5-alpha002\src\UglyToad.PdfPig\Parser\Parts\DirectObjectFinder.cs: Zeile79
   bei UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.Get[T](IToken token, IPdfTokenScanner scanner) in E:\Desktop\PdfPig-0.1.5-alpha002\src\UglyToad.PdfPig\Parser\Parts\DirectObjectFinder.cs: Zeile91
   bei UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean clipPaths) in E:\Desktop\PdfPig-0.1.5-alpha002\src\UglyToad.PdfPig\Parser\PageFactory.cs: Zeile137
   bei UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber, Boolean clipPaths) in E:\Desktop\PdfPig-0.1.5-alpha002\src\UglyToad.PdfPig\Content\Pages.cs: Zeile66
   bei UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber) in E:\Desktop\PdfPig-0.1.5-alpha002\src\UglyToad.PdfPig\PdfDocument.cs: Zeile169
   bei ConsoleApp9.Program.Main(String[] args) in C:\Users\david\source\repos\ConsoleApp9\ConsoleApp9\Program.cs: Zeile17
M4nju commented 3 years ago

Ok so i had a more in depth look where the first error occurs. Insidhe the PdfTokenScanner Inside the Method MoveNext this else is entered: image

The Start of readTokens looks like this: image

And the end looks like this (I cannot show the full data as it may contain sensitive information. But it looks like everything else is just the FlateDecoded stream-data. No Pdf-Tags in between): image

It seems a bit strange for me that he reads in the whole stream as Tokens.

M4nju commented 3 years ago

Ok So the Method TryReadStream fails to read the stream at this point: image The actual byte at the start of the stream is 32 --> A whitspace. If i am adding this whitspace to the if condition there is no error anymore. But the resulting PdfPig Page has no Text inside it. So i guess the stream inside the pdf is broken?

InusualZ commented 3 years ago

@InusualZ Okay i wasnt able to import the nuget Package u attached here as it is missing the reference to PdfPig.Core. So i downloaded the source Code from this Repository and built it. I got some additional error messages now that may help. I will also try to investigate whether i can find the issue. image

image

The Error Message stays the same with this StackTrace:

   bei UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.Get[T](IndirectReference reference, IPdfTokenScanner scanner) in E:\Desktop\PdfPig-0.1.5-alpha002\src\UglyToad.PdfPig\Parser\Parts\DirectObjectFinder.cs: Zeile79
   bei UglyToad.PdfPig.Parser.Parts.DirectObjectFinder.Get[T](IToken token, IPdfTokenScanner scanner) in E:\Desktop\PdfPig-0.1.5-alpha002\src\UglyToad.PdfPig\Parser\Parts\DirectObjectFinder.cs: Zeile91
   bei UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean clipPaths) in E:\Desktop\PdfPig-0.1.5-alpha002\src\UglyToad.PdfPig\Parser\PageFactory.cs: Zeile137
   bei UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber, Boolean clipPaths) in E:\Desktop\PdfPig-0.1.5-alpha002\src\UglyToad.PdfPig\Content\Pages.cs: Zeile66
   bei UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber) in E:\Desktop\PdfPig-0.1.5-alpha002\src\UglyToad.PdfPig\PdfDocument.cs: Zeile169
   bei ConsoleApp9.Program.Main(String[] args) in C:\Users\david\source\repos\ConsoleApp9\ConsoleApp9\Program.cs: Zeile17

In this picture, the exception that you are getting is not the same as the first one. Are you sure that you tested the same document?

Could you please try again with this one: UglyToad.PdfPig.0.1.5-alpha001.zip

If the package doesn't work again. Could you try setting a breakpoint here and print what is the type (name) of temp.

If that doesn't work, could you try setting a breakpoint here and print what the type (name) of token

Also, from what Microsoft service is that invoice?. I tested one from a TestSubscription that I have in Azure and it seems to work. G000000000.pdf

M4nju commented 3 years ago

@InusualZ The resulting error message stays the same. The Windows with the Red X are just the visualisation of the warnings which i havent had before because i was using the release build.

The nugetpackage didnt work again because of the missing dependencies.

So i did a breakpoint and he didnt went into the Method "T Get(IToken token, IPdfTokenScanner scanner)" but instead into "T Get(IndirectReference reference, IPdfTokenScanner scanner)" where the exception was thrown.

Here the values: reference = {7 0} typeof(T).Name = "StreamToken"

M4nju commented 3 years ago

My guess is that because the stream isn't read properly he cannot find the object.

EliotJones commented 1 year ago

Too difficult to fix without a file