UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.57k stars 225 forks source link

Unable to retrieve text from some PDF documents #796

Open securigy opened 3 months ago

securigy commented 3 months ago

I am using the library for a while now. However, today I noticed that if I save the content on the web as PDF using Microsoft PDF driver (that is, printing to PDF) then the code is unable to retrieve the text. Here is one of such examples that I print to PDF: https://healingthebody.ca/4-natural-proven-cancer-remedies/

and here is the code:

         `using (PdfDocument document = PdfDocument.Open(fileStream))
          {
                PdfDocInfo pdfDocInfo = new PdfDocInfo()
                {
                    DocFilePath = fileName,
                    TotalPages = document.NumberOfPages,
                    Version = document.Version,
                    Title = document.Information.Title,
                    Subject = document.Information.Subject,
                    Author = document.Information.Author,
                    DateCreated = dateCreated,
                    DateModified = dateModified,
                };

                string docText = "";
                string pattern = @"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])";

                foreach (Page page in document.GetPages())
                {
                    docText += ContentOrderTextExtractor.GetText(page, true);
                }

               // At this point docText is empty because each page delivers empty string through this GetText API`
         }

Any remedy for this?

BobLd commented 2 months ago

@securigy can you provide the exact pdf you used (generated from the html page I assume)