UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.72k stars 240 forks source link

"Object reference not set to an instance of an object." #874

Closed rklec closed 1 month ago

rklec commented 3 months ago

STR

PdfDocument.Open(pdfBytes) with the some PDF file. As it contains sensitive data, i unfortunately cannot attach it here and I was unfortunately unable to create a minimal example, but some hints:

Much like this and I tried to reproduce it with this example, but it does not work: grafik

Thus, i only attach this image, because with the PDF I've created it is not reproducible.

What happens

System.NullReferenceException
  HResult=0x80004003
  Nachricht = Object reference not set to an instance of an object.
  Quelle = UglyToad.PdfPig
  Stapelüberwachung:
   bei UglyToad.PdfPig.PdfExtensions.TryGet[T](DictionaryToken dictionary, NameToken name, IPdfTokenScanner tokenScanner, T& token)

Apparently, this is the line of failure: https://github.com/UglyToad/PdfPig/blob/a99c0d25bfe76e4e7a919a42c52c99022ac769d3/src/UglyToad.PdfPig/PdfExtensions.cs#L24

What should happen

At least PdfDocumentFormatException if you consider the file invalid.

However, IMHO, the file is valid an can be opened with both Adobe Acrobat Reader and Firefox. Thus, actually parsing it would be good.

Also, when opening it with Adobe Acrobat Reader and re-saving it, it can be parsed!

System

PDFPig 0.1.8 reproducible on Windows 10

Interne Referenz: 2118

BobLd commented 3 months ago

Hi @rklec it's going to be complicated to help you without the document...

Can you try with the latest version of PdfPig (pre-release 1.9.0, available via Nuget packages)?

jmjohnson05 commented 2 months ago

I'm running into this issue as well with the attached document. If I set SkipMissingFonts to true, the above exceptions gets thrown. When that option is not specified, I get the following exception instead: ErcotFacts.pdf

   at UglyToad.PdfPig.Util.DictionaryTokenExtensions.GetNameOrDefault(DictionaryToken dictionaryToken, NameToken name)
   at UglyToad.PdfPig.PdfFonts.Parser.Handlers.Type0FontHandler.ParseDescendant(DictionaryToken dictionary)
   at UglyToad.PdfPig.PdfFonts.Parser.Handlers.Type0FontHandler.Generate(DictionaryToken dictionary)
   at UglyToad.PdfPig.PdfFonts.FontFactory.Get(DictionaryToken dictionary)
   at UglyToad.PdfPig.Content.ResourceStore.LoadFontDictionary(DictionaryToken fontDictionary)
   at UglyToad.PdfPig.Content.ResourceStore.LoadResourceDictionary(DictionaryToken resourceDictionary)
   at UglyToad.PdfPig.Content.BasePageFactory`1.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, NamedDestinations namedDestinations)
   at UglyToad.PdfPig.Content.Pages.GetPage[TPage](IPageFactory`1 pageFactory, Int32 pageNumber, NamedDestinations namedDestinations, ParsingOptions parsingOptions)
   at UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber, NamedDestinations namedDestinations, ParsingOptions parsingOptions)
   at UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber)
   at UglyToad.PdfPig.PdfDocument.<GetPages>d__34.MoveNext()
   at System.Collections.Generic.LargeArrayBuilder`1.AddRange(IEnumerable`1 items)
   at System.Collections.Generic.EnumerableHelpers.ToArray[T](IEnumerable`1 source)
   at System.Linq.SystemCore_EnumerableDebugView`1.get_Items()

Any help with a fix for this would be greatly appreciated!

rklec commented 2 months ago

The linked ErcotFacts.pdf does not throw for me, surprisingly, though. (Encdoded and decoded in a mail, though)

jmjohnson05 commented 2 months ago

Hi @rklec should have clarified, but the exception I'm seeing occurs when calling the GetPages() method.

For example:

using PdfDocument? document = PdfDocument.Open( stream );

if ( document is null )
{
    _logger.LogWarning( "Failed to open PDF document" );

    return result;
}

foreach ( var pg in document.GetPages() ) 
{
    _logger.LogInformation( "Processing page {PageNumber}", pg.Number );
}
BobLd commented 2 months ago

thanks for sharing the document, I've created a PR that fixes the issue when SkipMissingFonts = true

jmjohnson05 commented 2 months ago

Much appreciated @BobLd