I'm using extractText() to create "text copies" of some online PDFs. In parts of the document, it appers that all of the text EXCEPT the page numbers are pulled.
This unexpected behavior appears to consistently occur in (and I believe only in) sections that comprise older pdf documents inserted as appendicies. So these sections are pages from older documents but with new pagination to identifiy the page's location in the appendix.
Here's sample code. Pages 24, 26, and 123 show their page numbers -- S-25, A-1, and C-2, respectively -- at the end of the page. Pages 28 and 53 (i.e. B-2 and B-27) do not show the page numbers from the PDF.
import urllib.request
import PyPDF2
# page containing PDF
url = "https://structuredginniemaes.ginnienet.com/RemicDB/deal/2014/097/GNMA-2014-097-@OCS.PDF"
# select the following pages: [S-25, A-1, B-2, B-27, C-2]
samplePages = [24, 26, 28, 53, 123]
# pull PDF
urllib.request.urlretrieve(url, "remoteFile")
pdfFile = PyPDF2.PdfFileReader("remoteFile", strict = False)
# begin extracting selected pages
print('-----------\n')
for pageNumber in samplePages:
print('PG #:', pageNumber, '\n')
pgN = pdfFile.getPage(pageNumber).extractText()
pgN = pgN.encode('ascii', 'ignore').decode('ascii')
print(pgN, '\n-----------\n')
Here is some system information.
PyPDF2 Version: 1.23
Python Version: '3.4.1 |Anaconda 2.0.1 (64-bit)| (default, Jun 11 2014, 17:27:11) [MSC v.1600 64 bit (AMD64)]'
OS: Windows 7
I'm using extractText() to create "text copies" of some online PDFs. In parts of the document, it appers that all of the text EXCEPT the page numbers are pulled.
This unexpected behavior appears to consistently occur in (and I believe only in) sections that comprise older pdf documents inserted as appendicies. So these sections are pages from older documents but with new pagination to identifiy the page's location in the appendix.
Here's sample code. Pages 24, 26, and 123 show their page numbers -- S-25, A-1, and C-2, respectively -- at the end of the page. Pages 28 and 53 (i.e. B-2 and B-27) do not show the page numbers from the PDF.
Here is some system information.
PyPDF2 Version: 1.23 Python Version: '3.4.1 |Anaconda 2.0.1 (64-bit)| (default, Jun 11 2014, 17:27:11) [MSC v.1600 64 bit (AMD64)]' OS: Windows 7
Thanks!