euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.24k stars 1.13k forks source link

Is there a way to read a page without reading the whole file? #215

Open mphilli opened 6 years ago

mphilli commented 6 years ago

I have a 225 MB PDF file with lots of charts, graphs, and images, and it takes a very long time to parse the entire file. The thing is, I only need data from the first 5 pages of the file. In PyPDF2, this is accomplished like this:

import PyPDF2

pdf_file = open(file, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)

for i in range(0, 5):
    page = read_pdf.getPage(i)
    page_content = page.extractText()

This gets the page content of the first 5 pages in under a second. I try to replicate this with pdfminer by doing:

for page in PDFPage.get_pages(pdf_file, pagenos=[0, 1, 2, 3, 4]):

But the program stops here for what feels like a decade. This tells me it isn't just getting the first 5 pages, but is first parsing the entire file, and then getting the pages I need. Is there any way to just get the pages as I would like to do here? If not, I would suggest it as a feature.

I would just use PyPDF2, but pdfminer is much more accurate at extracting text.

abhishek-jain-infrrd commented 6 years ago

One possible way is that you can combine both pdfminer and PyPDF2. You can use PyPDF2 to get the first five pages and make a new PDF out of it and then parse that PDF with pdfminer.

On Friday, April 6, 2018, Michael Phillips notifications@github.com wrote:

I have a 225 MB PDF file with lots of charts, graphs, and images, and it takes a very long time to parse the entire file. The thing is, I only need data from the first 5 pages of the file. In PyPDF2, this is accomplished like this:

import PyPDF2

pdf_file = open(file, 'rb') read_pdf = PyPDF2.PdfFileReader(pdf_file) for i in range(0, 5): page = read_pdf.getPage(i) page_content = page.extractText()

This gets the page content of the first 5 pages in under a second. I try to replicate this with pdfminer by doing:

for page in PDFPage.get_pages(pdf_file, pagenos=[0, 1, 2, 3, 4]):

But the program stops here for what feels like a decade. This tells me it isn't just getting the first 5 pages, but is first parsing the entire file, and then getting the pages I need. Is there any way to just get the pages as I would like to do here? If not, I would suggest it as a feature.

I would just use PyPDF2, but pdfminer is much more accurate at extracting text.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/euske/pdfminer/issues/215, or mute the thread https://github.com/notifications/unsubscribe-auth/AVZ_fejdfuRpwCp61sdSfE6Vhcjg_wdvks5tl4JdgaJpZM4TKMa8 .

-- Abhishek Jain

TomOfHelatrobus commented 5 years ago

This is for pdfminer. I never found PyPDF2 to take too much time. Try something like this... Where the "maxpages" is set to 5.

with open(filepath, 'rb') as filein:

interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 5 caching = True pagenos=set() count=0 for page in PDFPage.get_pages(filein, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True): interpreter.process_page(page)