Open jos1337 opened 7 years ago
The latest post in pdfhummus.com tells about text extraction and it includes figuring out abs position of each
thank you, works really nice :)
I'm trying to use the text-extraction code to find the absolute positions of text blobs. In some pdfs the logic works as expected. But in documents like this pdf, I'm getting nonsensical BBox data that positions all the text way off the page, e.g.:
{
"text": "US ",
"matrix": [
7636.114266623998,
0,
0,
11217.188543385599,
33274.36791681407,
615403.0079999999
],
"localBBox": [
0,
-0.157,
0,
0.629
],
"globalBBox": [
33274.36791681407,
613641.9093986884,
33274.36791681407,
622458.6195937895
]
}
If I analyze the same document using PDF.js to extract text objects, I get rational position data, e.g.,
{
"x": 59.759999850599996,
"y": 65.51999999999998,
"str": "US ",
"dir": "ltr",
"width": 18.2399999544,
"height": 188.08163171265306,
"fontName": "Courier"
},
Do you have any idea why the hummus text-extraction logic is reporting incorrect position data in documents like the attached?
Im trying to find a string in a pdf and locate its absolute position on a page. Is this possible?