empira / PDFsharp-1.5

A .NET library for processing PDF
MIT License
1.28k stars 589 forks source link

Many PDF docs from MSWord do not open (file does not appear corrupted) #135

Closed Brandon2255p closed 1 year ago

Brandon2255p commented 3 years ago

Reporting an Issue Here

Attached PDF was generated from Microsoft Word as a PDF, I ran this through online validators and they say it is valid and PDF 1.3

Example.pdf

When doing

using (var pdfDocument = PdfReader.Open(pdfStream, PdfDocumentOpenMode.Import))
                    {
                        CopyPages(pdfDocument, outPdf);
                    }

The file throws an exception

Expected Behavior

The file should open because it is not corrupted

Actual Behavior

"Invalid entry in XRef table, ID=8, Generation=0, Position=0, ID of referenced object=4, Generation of referenced object=0"

Steps to Reproduce the Behavior

using (var pdfDocument = PdfReader.Open(pdfStream, PdfDocumentOpenMode.Import))
                    {
                        CopyPages(pdfDocument, outPdf);
                    }
Stealcase commented 3 years ago

We are having the same problem. Are you generating PDFs in Word on a Mac OS? This appears to the culprit in our case.

Brandon2255p commented 3 years ago

The PDFs could very well be generated in Word on a Mac. I did not create it nor can I trace who created it. But we have experienced it a few times so far. Good observation thanks!

jsauvain commented 3 years ago

On a Mac you have two options to create the PDF, either for best printing or for best online usage. If you select the best for printing option, you will not be able to use it in PDFSharp. It is indeed unfortunate that word creates invalid PDF in that case but anyways, the library must ignore those issues otherwise it is not really useable

ThomasHoevel commented 1 year ago

It is indeed unfortunate that word creates invalid PDF in that case but anyways, the library must ignore those issues otherwise it is not really useable

The files provide contradicting information. PDFsharp 6.0.0 now ignores one piece of information, relying on the other piece of information, so files from Word should work.