UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.73k stars 241 forks source link

New lines #726

Open Lesaje opened 1 year ago

Lesaje commented 1 year ago
        var result = new List<Invoice>();
        foreach (var file in directoryInfo.GetFiles("*.pdf"))
        {
            var text = "";
            using (PdfDocument document = PdfDocument.Open(file.FullName))
            {
                foreach (Page page in document.GetPages()) { text += page.Text; }
            }
            result.Add(new Invoice(file.Name, text));
        }
        return result;
var result = new Invoice(file.Name.Replace(".pdf", ".txt"), text);
File.WriteAllText(result.Path, result.Content);

When using this code, i get following result: $10.00 USD due October 23, 2023Page 1 of 1Date of issueOctober 23, 2023Date dueOctober 23

So there is clearly some problem with new lines. Could that be fixed somehow?

HuwSy commented 11 months ago

I have had to work around both carriage returns and missing spaces to mirror PDFBox as follows. This appears to work fine 99% of the time, until there is a very odd order layout encountered, and gives the outcome in the variable strPDFTXTOut

// 1px used as pdf accuracy is not ideal var deviation = 1; // string to build to var strPDFTXTOut = string.Empty; var letters = page.Letters; // get the 1st letter for coordinates, sizes etc var lastLetter = letters[0]; // each letter foreach (var letter in letters) { // calc difference in vertical and horizontal position to last latter var difY = letter.Location.Y - lastLetter.Location.Y; var difX = letter.Location.X - lastLetter.Location.X - lastLetter.Width; if (difY < -deviation || difY > deviation) { // if the letter is more than px vertical different from last letter then its a carriage return strPDFTXTOut += "\r\n"; } else if (difX > deviation) { // if the letter is more than px horizontal different from last letter then its a space strPDFTXTOut += " "; } // add this letter strPDFTXTOut += letter.Value; // save this letter as last letter lastLetter = letter; }

mayurjansari commented 11 months ago

if you want new line, try.

 var text = "";
using (PdfDocument document = PdfDocument.Open(file.FullName))
          {
                foreach (Page page in document.GetPages()) 
                      { 
                         text += ContentOrderTextExtractor.GetText(page);
                      }
            }
            result.Add(new Invoice(file.Name, text));

;

EliotJones commented 9 months ago

See also