Open vincenzo84 opened 2 years ago
you need to use a co-ordinate viewer to see why that value may be wrong see here https://github.com/christian-vigh-phpclasses/PdfToText/blob/master/examples/text-capture/sample-report.pdf supposedly produces this result
[Page : 1, width = 596, height = 843]
[x:248.76, y:760.4, w: 79.895, h:12]REPORT HEADER
[x:70.695, y:746.6, w: 2.381, h:12]
[x:84.495, y:722.6, w: 2.381, h:12]
[x:84.495, y:734.6, w: 2.381, h:12]
[x:0, y:708.08, w: 124.619, h:12]Column1 Column2 Column3
[x:70.8, y:690.32, w: 76.99, h:12]L1C1 L1C2 L1C3
note in this example the second last row x:0 appears not to be correct especially if w:124.619 is supposedly right too
The PDF file reports MediaBox[0 0 595.28 841.89]/Rotate 0 so it appears there is some odd math's rounding up as integer should be width 595 height 842 the first text block is supposedly at this point so we can agree its 12 points high (h:12) /R8 12 Tf 1.00055 0 0 1 248.76 760.24 Tm [<01>-2.64015<02>1.17611<03>-3.54053<04>2.56451<01>-2.64015<05>1.17611<06>.136644<07>2.56451<02>1.17611(\b)2.56451(\t)2.56451<02>1.17611<01>-2.64014<06>] TJ
We can also see there is some odd scalar (1.00055) that's going to upset scaled calculations and we can also see the text is not normal. It is mapped thus <01>=R <02>=E <03>=P <04>=O <01>=R <05>=T <06>=" " thus spells out REPORT" "HEADER" "
and we also see each letter is using a twirking factor to jiggle its position (Kerning) but those values are too erratic to be used for the length of the string thus the best we can accept is the start values and height the width is unlikely to be of much value especially with the included white spaces after each sub part of the string.
so why is the odd row in that block of text at x:0 it should be like others well defined and the answer is that it is relative to all the previous widths so has no absolute value for x: but why is the next row showing a reasonable [x:70.8 ? well that's because the string is a fresh Absolute (not relative) location e.g. x:70.8 y:690.16 is where we find L1C1 L1C2 L1C3
where <06> is still =" " and (\n)=L <0B>=1 (\f)=C (\r)=2 and <0E>=3
that's right there is very little logic in reversing each PDF using a logical methodology since its a language stack
BT
/R8 12 Tf
1.00055 0 0 1 70.8 690.16 Tm
[(\n)11.1694<0B>.274507(\f)-2.63954<0B>.274792<06>-10264.2(\n)11.1694<0B>.274507(\f)-2.64015(\r).274792<06>-10274.2(\n)11.1706<0B>.275727(\f)-2.64015<0E>.274792<06>] TJ
ET
So in summary PDF is not the easiest way to define blobs of ink (glyphs) and trying to measure relative offsets in poorly defined strings of text contents is prone to errors.
I need help, I'm going crazy with this problem.
explains how to get the coordinates from a PDF document, I ran this code:
This code returns me a series of lines with their coordinates, such as:
Now I would like to consider the line with the "TEST 5"
To do this I created the following xml file (test.xml):
What is not clear to me is how to take the value of the "right" attribute, in any case by running the script below and integrating the xml file specified above I do not get any results
I cannot understand where I am wrong. Thanks for any invaluable help. Greetings