galkahana / HummusJS

Node.js module for high performance creation, modification and parsing of PDF files and streams
http://www.pdfhummus.com
Other
1.15k stars 170 forks source link

locate absolute position of item #161

Open jos1337 opened 7 years ago

jos1337 commented 7 years ago

Im trying to find a string in a pdf and locate its absolute position on a page. Is this possible?

galkahana commented 7 years ago

The latest post in pdfhummus.com tells about text extraction and it includes figuring out abs position of each

jos1337 commented 7 years ago

thank you, works really nice :)

chadkirby commented 6 years ago

I'm trying to use the text-extraction code to find the absolute positions of text blobs. In some pdfs the logic works as expected. But in documents like this pdf, I'm getting nonsensical BBox data that positions all the text way off the page, e.g.:

{
  "text": "US ",
  "matrix": [
    7636.114266623998,
    0,
    0,
    11217.188543385599,
    33274.36791681407,
    615403.0079999999
  ],
  "localBBox": [
    0,
    -0.157,
    0,
    0.629
  ],
  "globalBBox": [
    33274.36791681407,
    613641.9093986884,
    33274.36791681407,
    622458.6195937895
  ]
}

If I analyze the same document using PDF.js to extract text objects, I get rational position data, e.g.,

        {
          "x": 59.759999850599996,
          "y": 65.51999999999998,
          "str": "US ",
          "dir": "ltr",
          "width": 18.2399999544,
          "height": 188.08163171265306,
          "fontName": "Courier"
        },

Do you have any idea why the hummus text-extraction logic is reporting incorrect position data in documents like the attached?