Closed joelostblom closed 4 years ago
As the python binding simply calls the C++ methods, maybe you can better ask on the poppler mailling list. Maybe the rect uses another coordinate system?
Page.text() uses point coordinates: https://people.freedesktop.org/~aacid/docs/qt5/classPoppler_1_1Page.html#a6a9b966d69e2f1adc6654f42388a1e74
It's not clear to me what coordinates Annotation.boundary() is using. Just trail and error somewhat I'd guess :-)
See also: https://stackoverflow.com/questions/21050551/extracting-text-from-highlighted-annotations-in-a-pdf-file which names a solution in the first answer. Indeed the coordinate scaling is the issue.
Thank you for replying and finding that SO answer! Modifying accordingly worked:
import popplerqt5
from PyQt5 import QtCore
doc = popplerqt5.Poppler.Document.load('./test1.pdf')
page = doc.page(0)
pwidth = page.pageSize().width()
pheight = page.pageSize().height()
for annot in page.annotations():
if annot.contents():
print(annot.contents())
if(isinstance(annot, popplerqt5.Poppler.HighlightAnnotation)):
quads = annot.highlightQuads()
for quad in quads:
rect_coords = (quad.points[0].x() * pwidth,
quad.points[0].y() * pheight,
quad.points[2].x() * pwidth,
quad.points[2].y() * pheight)
rect = QtCore.QRectF()
rect.setCoords(*rect_coords)
rect_txt = page.text(rect)
if rect_txt == '':
print('---')
else:
print(f'== highlighted text: {rect_txt}')
Out:
highlight note 1
== highlighted text: working
highlight note 2
== highlighted text: not working
== highlighted text: Some more text
note on page
I am trying to extract highlighted text from a PDF (so text that is underneath a highlight annotation). When I pass the annotation boundary rectangle to
page.text()
, no text is returned. Am I missing something?Out:
test1.pdf