empira / PDFsharp-1.5

A .NET library for processing PDF
MIT License
1.28k stars 588 forks source link

PDFSharp Fixes #39

Open mlaukala opened 6 years ago

mlaukala commented 6 years ago

This is my Release branch that I will be working off of from this point forward. I tried making modular fixes in hopes that previous pull requests would be easy to identify changes.

As I work with Pdfs from many Pdf producers, I have been able to make numerous fixes for out of spec Pdfs. If adobe can read the file, my app must be able to read it as well. As a result, dealing with merge conflicts on my Release branch is becoming a major issue.

TH-Soft commented 6 years ago

Thanks for all your code changes. I plan to incorporate them after a stable version of PDFsharp 1.50 was published. It would be nice if you could provide non-confidential PDFs that allow us to evaluate the changes.

mlaukala commented 6 years ago

I'll need to confirm that I can provide the PDFs. Could be a couple of days.

mlaukala commented 6 years ago

I've been extremely busy over the past few months and things have finally slowed down a bit. I can not provide PDFs at this time that will reproduce the errors I was getting. I do hope to create example PDFs that will reproduce the errors. This will make it easier for me to do my testing when updates and new fixes are implemented. Before that happens, I'll try to find the time to revisit the fixes that I have made and write out better comments for them, describing the exact issue and the PDF producer that caused the issue along with a link to the adobe spec and relevant section of the adobe spec.

mlaukala commented 6 years ago

This latest fix is for invalid startxref byte offset. If the xref table cannot be found at the specified byte offset, it is assumed that all byte offsets are incorrect and the xref table and trailer is rebuilt.

mlaukala commented 6 years ago

Made an amendment that makes sure the latest generation root/catalog is used.

mlaukala commented 5 years ago

The endstream checks were not looking for an eol char before the endstream keyword and causing massive slow downs when reading huge PDF files with a lot of stream.

aggsol commented 5 years ago

Will this be ever merged?

mlaukala commented 5 years ago

Will this be ever merged?

Sadly, probably not. I am not able to supply them with pdfs that reproduce the errors caused nor do I have the time to create sample pdfs that duplicate the errors. It's on my very long list of things to do but it's not high priority so it keeps getting pushed back.

leonardobaggio commented 5 years ago

hi @MLaukala, amazing work on this PR, thank you! This issue about corrupted PDF has been a long time headache for me. I'm planning to use your forked version of PDFSharp, but I don't know if there another ways instead using it directly referenced on my solution, building it locally. Do you have any suggestions to achieve similar integration as provided by Nuget, but using this fork?

mlaukala commented 5 years ago

I do not, sorry.

ken-sands commented 5 years ago

Applying these fixes actually caused PDFs to corrupt on saving/reopening for me. with these in place opening and saving a pdf, then opening and saving it again would end up with elements missing, colours inverted, all sorts of stuff. If after each save the pdf is opened and saved from pdftk or similar it can be brought back from death. While it looks like a great effort towards handling pdfs with issues it currently causes more issues than it solves for us.

mlaukala commented 5 years ago

Care to provide a sample PDF and code? I would love to attempt to work out what is causing the problem. We work with thousands of PDFs a day. 99% of the time, none of our PDFs have any issues. Of the ones that do fail, it's usually a result of the PDF producer not following the PDF specification.

ken-sands commented 5 years ago

I'll have to edit one to remove details and will need you to agree to take it to test only, delete after testing and not to pass it on at all but yes. is there a way to directly message you with a pdf?

ken-sands commented 5 years ago

I've chopped the personalised pages from my pdf so I can share it with you (though it's still a customer document so I can't make it public unfortunately) my discord tag is captain_ken#8332 I'm UK based (GMT timezone) I've just run a test on the chopped pdf with a fresh build of your release (just in case it was other tweaks I have that were causing it) Same issues persist.

mlaukala commented 5 years ago

I just sent you a request on discord, I am MJ#2945. I'll be out of town until sunday evening. I should be able to take a look at some point next week.

ken-sands commented 5 years ago

Yep cool, I've sent you PDFs and rambling details on what happens, should be enough for you get the same results.

alshezawi commented 4 years ago

Thank you. your fork helped me to solve Unexpected character '0xffff' in PDF stream and PDF corrupted errors.