colemana / PyPDF2

A utility to read and write pdfs with Python. Superseded: see https://github.com/knowah/PyPDF2
Other
83 stars 19 forks source link

extractText() not extracting page number text from particular document sections #10

Closed msb236 closed 10 years ago

msb236 commented 10 years ago

I'm using extractText() to create "text copies" of some online PDFs. In parts of the document, it appers that all of the text EXCEPT the page numbers are pulled.

This unexpected behavior appears to consistently occur in (and I believe only in) sections that comprise older pdf documents inserted as appendicies. So these sections are pages from older documents but with new pagination to identifiy the page's location in the appendix.

Here's sample code. Pages 24, 26, and 123 show their page numbers -- S-25, A-1, and C-2, respectively -- at the end of the page. Pages 28 and 53 (i.e. B-2 and B-27) do not show the page numbers from the PDF.

import urllib.request 
import PyPDF2

# page containing PDF
url = "https://structuredginniemaes.ginnienet.com/RemicDB/deal/2014/097/GNMA-2014-097-@OCS.PDF"

# select the following pages: [S-25, A-1, B-2, B-27, C-2]
samplePages = [24, 26, 28, 53, 123]

# pull PDF
urllib.request.urlretrieve(url, "remoteFile")
pdfFile = PyPDF2.PdfFileReader("remoteFile", strict = False)

# begin extracting selected pages
print('-----------\n')
for pageNumber in samplePages:
    print('PG #:', pageNumber, '\n')
    pgN = pdfFile.getPage(pageNumber).extractText()
    pgN = pgN.encode('ascii', 'ignore').decode('ascii')
    print(pgN, '\n-----------\n')

Here is some system information.

PyPDF2 Version: 1.23 Python Version: '3.4.1 |Anaconda 2.0.1 (64-bit)| (default, Jun 11 2014, 17:27:11) [MSC v.1600 64 bit (AMD64)]' OS: Windows 7

Thanks!

msb236 commented 10 years ago

Apologies! I just noticed that PyPDF2 is now maintained by mstamy2. Should I close the issue here and post it to mstamy2? Thanks!

msb236 commented 10 years ago

I'll close here and post to current fork