euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.25k stars 1.13k forks source link

LTPage _objs empty #295

Open fulganta opened 4 years ago

fulganta commented 4 years ago

Hi,

I have issue with a specific kind of pdf that return me no data: device.get_result() result have the _objs list empty. I have been using the code below for quite a while, but this kind of pdf is empty. This pdf doesn't seem to be only image as I can use acrobat reader to convert it into text. I was wondering if there is something I can change regarding laparamsor the PDFRessourceManager. Please let me know your email and I can send you the pdf, I don't want to post it here.

from pdfminer.layout import LAParams, LTTextBoxHorizontal
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

pdf_name = 'MyPdf.pdf'

document = open(pdf_name, 'rb')

pdf_str = ''
# Create resource manager
rsrcmgr = PDFResourceManager()
# Set parameters for analysis.
# laparams = LAParams()

laparams = LAParams(
    detect_vertical=True, all_texts = True
)
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(document)
for page in pages:
    interpreter.process_page(page)
    # receive the LTPage object for the page.
    pdf_extract = device.get_result()
    layout = device.get_result()
    pdf_str = pdf_str + ''.join(
        element.get_text().lower() for element in layout if isinstance(element, LTTextBoxHorizontal))

print(pdf_str)
fulganta commented 4 years ago

Sorry some more points maybe: I tried different combinations of LAParams values. None return me something. The values I get now in device.result which is my LTPage object:

_objs =  <class 'list'>: []
bbox = <class 'tuple'>: (0, 0, 595, 842)
groups =  None
height =  842
pageid = 1 
rotate = 0 
width = 595
x0 = 0
x1 = 595
y0 = 0
y1 = 842