frescobaldi / python-poppler-qt5

Python binding to libpoppler-qt5
Other
42 stars 32 forks source link

Extract highlighted text #36

Closed joelostblom closed 4 years ago

joelostblom commented 4 years ago

I am trying to extract highlighted text from a PDF (so text that is underneath a highlight annotation). When I pass the annotation boundary rectangle to page.text(), no text is returned. Am I missing something?

import popplerqt5

doc = popplerqt5.Poppler.Document.load('./test1.pdf')
page = doc.page(0)

for annot in page.annotations():
    if annot.contents():
        print(annot.contents())
        rect = annot.boundary()
        rect_txt = page.text(rect)
        if rect_txt == '':
            print('---')
        else:
            print(rect_txt)

Out:

highlight note 1
---
highlight note 2
---
note on page
---

test1.pdf

wbsoft commented 4 years ago

As the python binding simply calls the C++ methods, maybe you can better ask on the poppler mailling list. Maybe the rect uses another coordinate system?

Page.text() uses point coordinates: https://people.freedesktop.org/~aacid/docs/qt5/classPoppler_1_1Page.html#a6a9b966d69e2f1adc6654f42388a1e74

It's not clear to me what coordinates Annotation.boundary() is using. Just trail and error somewhat I'd guess :-)

wbsoft commented 4 years ago

See also: https://stackoverflow.com/questions/21050551/extracting-text-from-highlighted-annotations-in-a-pdf-file which names a solution in the first answer. Indeed the coordinate scaling is the issue.

joelostblom commented 4 years ago

Thank you for replying and finding that SO answer! Modifying accordingly worked:

import popplerqt5
from PyQt5 import QtCore

doc = popplerqt5.Poppler.Document.load('./test1.pdf')
page = doc.page(0)
pwidth = page.pageSize().width()
pheight = page.pageSize().height()

for annot in page.annotations():
    if annot.contents():
        print(annot.contents())
    if(isinstance(annot, popplerqt5.Poppler.HighlightAnnotation)):
        quads = annot.highlightQuads()
        for quad in quads:
            rect_coords = (quad.points[0].x() * pwidth,
                           quad.points[0].y() * pheight,
                           quad.points[2].x() * pwidth,
                           quad.points[2].y() * pheight)
            rect = QtCore.QRectF()
            rect.setCoords(*rect_coords)
            rect_txt = page.text(rect)
            if rect_txt == '':
                print('---')
            else:
                print(f'== highlighted text: {rect_txt}')

Out:

highlight note 1
== highlighted text:  working
highlight note 2
== highlighted text:  not working
== highlighted text: Some more text
note on page