euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.25k stars 1.13k forks source link

can't read pdf due to this warning : WARNING:pdfminer.layout:Too many boxes (102) to group, skipping. #202

Open lawofearth opened 6 years ago

lawofearth commented 6 years ago

Hello, I use this code.

` def pdfread(fp):

parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.char_margin = 1.0
laparams.word_margin = 1.0
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
extracted_text = ''

for page in doc.get_pages():   
    interpreter.process_page(page)
    layout = device.get_result()

    for lt_obj in layout:
        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
            extracted_text += lt_obj.get_text()
fp.close()

return extracted_text

And it keeps showing this WARNING:pdfminer.layout: Too many boxes (102) to group, skipping. This file 10200112008r.pdf

PS. I'm new in Python.

I think it is layout issue so I want to turn Auto layout analysis off by typing '-n'. But I didn't do it on the command line. I do it on Jupyter what should I do?

Best regards Lyga

ChenBing-ML commented 4 years ago

https://github.com/jaepil/pdfminer3k/blob/master/pdfminer/layout.py

ChenBing-ML commented 4 years ago

image 自己改改源码吧

moun3imy commented 3 years ago

image 自己改改源码吧 how do you solve this issue ? I'm getting same problem using slate3k. Thanks in advance ^^