UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.67k stars 238 forks source link

Issue reading from Pubmed (different from first one) #293

Closed christopher5106 closed 3 years ago

christopher5106 commented 3 years ago

When reading this document: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835191/pdf/mt200953a.pdf

Unhandled exception. System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at UglyToad.PdfPig.Core.ByteArrayInputBytes.Seek(Int64 position)
   at UglyToad.PdfPig.Tokenization.Scanner.CoreTokenScanner.Seek(Int64 position)
   at UglyToad.PdfPig.Parser.FileStructure.XrefOffsetValidator.CheckXRefOffset(Int64 startXRefOffset, ISeekableTokenScanner scanner, IInputBytes inputBytes, Boolean isLenientParsing)
   at UglyToad.PdfPig.Parser.FileStructure.CrossReferenceOffsetValidator.Validate(Int64 crossReferenceOffset, ISeekableTokenScanner scanner, IInputBytes bytes, Boolean isLenientParsing)
   at UglyToad.PdfPig.Parser.PdfDocumentFactory.OpenDocument(IInputBytes inputBytes, ISeekableTokenScanner scanner, ILog log, Boolean isLenientParsing, IReadOnlyList`1 passwords, Boolean clipPaths)
   at UglyToad.PdfPig.Parser.PdfDocumentFactory.Open(IInputBytes inputBytes, ParsingOptions options)
   at UglyToad.PdfPig.Parser.PdfDocumentFactory.Open(Byte[] fileBytes, ParsingOptions options)
   at UglyToad.PdfPig.Parser.PdfDocumentFactory.Open(String filename, ParsingOptions options)
   at UglyToad.PdfPig.PdfDocument.Open(String filePath, ParsingOptions options)
   at PdfToJsonExporter.Application.Process(FileInfo fileInfo) in /mnt/c/Users/cbo/apps/document_structure_comprehension_dataset/PdfPig/Application.cs:line 86
   at PdfToJsonExporter.Application.Process(DirectoryInfo directoryInfo) in /mnt/c/Users/cbo/apps/document_structure_comprehension_dataset/PdfPig/Application.cs:line 64
   at PdfToJsonExporter.Application.Run() in /mnt/c/Users/cbo/apps/document_structure_comprehension_dataset/PdfPig/Application.cs:line 42
   at PdfToJsonExporter.Program.<>c.<Main>b__1_0(CommandLineOptions o) in /mnt/c/Users/cbo/apps/document_structure_comprehension_dataset/PdfPig/Program.cs:line 18
   at CommandLine.ParserResultExtensions.WithParsed[T](ParserResult`1 result, Action`1 action)
   at PdfToJsonExporter.Program.Main(String[] args) in /mnt/c/Users/cbo/apps/document_structure_comprehension_dataset/PdfPig/Program.cs:line 15
plaisted commented 3 years ago

Just taking a quick peak looks like the PDF had a bad startxref offset:

startxref
1226474
%%EOF

The 1226474 byte offset is larger than the total size of the document. The document is linearized so I'm guessing readers that open this document fine are using the linearized xref data instead of the trailing one. I know PdfPig tries to repair this sort of thing in lenient parsing mode but must not be able to in this case.

christopher5106 commented 3 years ago

Thank you. I'm think this technology is worth because of your reactivity and it is opensource and written in C# So, don't worry if I post more issues, less critical, it is because it looks good.