Open Lesaje opened 1 year ago
I have had to work around both carriage returns and missing spaces to mirror PDFBox as follows. This appears to work fine 99% of the time, until there is a very odd order layout encountered, and gives the outcome in the variable strPDFTXTOut
// 1px used as pdf accuracy is not ideal
var deviation = 1;
// string to build to
var strPDFTXTOut = string.Empty;
var letters = page.Letters;
// get the 1st letter for coordinates, sizes etc
var lastLetter = letters[0];
// each letter
foreach (var letter in letters)
{
// calc difference in vertical and horizontal position to last latter
var difY = letter.Location.Y - lastLetter.Location.Y;
var difX = letter.Location.X - lastLetter.Location.X - lastLetter.Width;
if (difY < -deviation || difY > deviation)
{
// if the letter is more than
if you want new line, try.
var text = "";
using (PdfDocument document = PdfDocument.Open(file.FullName))
{
foreach (Page page in document.GetPages())
{
text += ContentOrderTextExtractor.GetText(page);
}
}
result.Add(new Invoice(file.Name, text));
;
When using this code, i get following result:
$10.00 USD due October 23, 2023Page 1 of 1Date of issueOctober 23, 2023Date dueOctober 23
So there is clearly some problem with new lines. Could that be fixed somehow?