locate absolute position of item

jos1337 commented 7 years ago

Im trying to find a string in a pdf and locate its absolute position on a page. Is this possible?

galkahana commented 7 years ago

The latest post in pdfhummus.com tells about text extraction and it includes figuring out abs position of each

jos1337 commented 7 years ago

thank you, works really nice :)

chadkirby commented 6 years ago

I'm trying to use the text-extraction code to find the absolute positions of text blobs. In some pdfs the logic works as expected. But in documents like this pdf, I'm getting nonsensical BBox data that positions all the text way off the page, e.g.:

{
  "text": "US ",
  "matrix": [
    7636.114266623998,
    0,
    0,
    11217.188543385599,
    33274.36791681407,
    615403.0079999999
  ],
  "localBBox": [
    0,
    -0.157,
    0,
    0.629
  ],
  "globalBBox": [
    33274.36791681407,
    613641.9093986884,
    33274.36791681407,
    622458.6195937895
  ]
}

If I analyze the same document using PDF.js to extract text objects, I get rational position data, e.g.,

        {
          "x": 59.759999850599996,
          "y": 65.51999999999998,
          "str": "US ",
          "dir": "ltr",
          "width": 18.2399999544,
          "height": 188.08163171265306,
          "fontName": "Courier"
        },

Do you have any idea why the hummus text-extraction logic is reporting incorrect position data in documents like the attached?

galkahana / HummusJS

locate absolute position of item #161