Open romanimm opened 6 months ago
I am getting the same error as [romanimm] plus null reference exception and Invalid ColorSpace token encountered. Would you please investigate and push a fix?
No font descriptor indirect reference found in the TrueType font: <BaseFont, /KVGATS+GNElliot-Bold>, <Encoding, /WinAnsiEncoding>, <FirstChar, 32>, <FontDescriptor, 48 0>, <LastChar, 117>, <Subtype, /Type1>, <ToUnicode, 71 0>, <Type, /Font>, <Widths, [ 240, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 265, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 681, 0, 0, 0, 0, 0, 0, 777, 302, 0, 0, 0, 865, 777, 777, 0, 0, 660, 588, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 551, 0, 0, 601, 551, 383, 601, 601, 259, 0, 0, 0, 0, 601, 601, 0, 0, 401, 463, 415, 601 ]>. [08/23/2024 16:50:29 > d0b8e7: INFO] at UglyToad.PdfPig.PdfFonts.Parser.FontDictionaryAccessHelper.GetFontDescriptor(IPdfTokenScanner pdfScanner, DictionaryToken dictionary)
Invalid ColorSpace token encountered in page resource dictionary: 655 0.
[08/23/2024 16:39:56 > d0b8e7: INFO] at UglyToad.PdfPig.Content.ResourceStore.LoadResourceDictionary(DictionaryToken resourceDictionary, InternalParsingOptions parsingOptions)
[08/23/2024 16:39:56 > d0b8e7: INFO] at UglyToad.PdfPig.Graphics.ContentStreamProcessor.ProcessFormXObject(StreamToken formStream)
[08/23/2024 16:39:56 > d0b8e7: INFO] at UglyToad.PdfPig.Graphics.ContentStreamProcessor.ApplyXObject(NameToken xObjectName)
[08/23/2024 16:39:56 > d0b8e7: INFO] at UglyToad.PdfPig.Graphics.Operations.InvokeNamedXObject.Run(IOperationContext operationContext)
[08/23/2024 16:39:56 > d0b8e7: INFO] at UglyToad.PdfPig.Graphics.ContentStreamProcessor.ProcessOperations(IReadOnlyList1 operations) [08/23/2024 16:39:56 > d0b8e7: INFO] at UglyToad.PdfPig.Graphics.ContentStreamProcessor.Process(Int32 pageNumberCurrent, IReadOnlyList
1 operations)
[08/23/2024 16:39:56 > d0b8e7: INFO] at UglyToad.PdfPig.Parser.PageFactory.GetContent(Int32 pageNumber, IReadOnlyList`1 contentBytes, CropBox cropBox, UserSpaceUnit userSpaceUnit, PageRotationDegrees rotation, MediaBox mediaBox, InternalParsingOptions parsingOptions)
[08/23/2024 16:39:56 > d0b8e7: INFO] at UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, NamedDestinations namedDestinations, InternalParsingOptions parsingOptions)
[08/23/2024 16:39:56 > d0b8e7: INFO] at UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber, NamedDestinations namedDestinations, InternalParsingOptions parsingOptions)
[08/23/2024 16:39:56 > d0b8e7: INFO] at UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber)
[08/23/2024 16:39:56 > d0b8e7: INFO] at UglyToad.PdfPig.PdfDocument.GetPages()+MoveNext()
@mahmoodali31 can you share the problematic pdf file?
@BobLd I cannot share the PDF as it is confidential. I am using a PDF stream to extract content. I double-checked the PDF. It contains an image and text at the bottom.
StringBuilder sb = new(); using var document = UglyToad.PdfPig.PdfDocument.Open(stream, new UglyToad.PdfPig.ParsingOptions() { UseLenientParsing = true, SkipMissingFonts = true }); int count = document.NumberOfPages; for (int i = 1; i <= count; i++) { var page = document.GetPage(i); var letters = page.Letters;
var wordExtractor = NearestNeighbourWordExtractor.Instance;
var words = wordExtractor.GetWords(letters);
sb.AppendLine(string.Join(" ", words.Select(w => w.Text)));
} return sb.ToString();
I've stumbled upon this pdf, which throws an
InvalidFontFormatException
up on calling document.GetPages() or GetPage(x).I've tried with different
ParsingOptions
:SkipMissingFonts = true
gives a null pointer exceptionUseLenientParsing = true
has no effectLog:
Tested with 0.1.8 and 0.1.9-alpha-20240419-1ef2e on windows an linux (alpine).