claird / PyPDF4

A utility to read and write PDFs with Python
obsolete-https://pythonhosted.org/PyPDF2/
Other
330 stars 61 forks source link

pdf.getDocumentInfo().title sometimes None #59

Open clach04 opened 5 years ago

clach04 commented 5 years ago

Just found this fork/project after logging https://github.com/mstamy2/PyPDF3/issues/13 test case below is for PyPDF4.

I've seen a number of PDF files where the title attribute/property is reported as None but when then accessing /Title there is content. I've no idea if this is a problem with the pdf(s) or with PyPDF. There is a workaround (which may be an indication of a potential change to PyPDF but I'm unclear of what the correct thing to do here is)

Attached PDF title_bug.pdf is about 5Mb and is a sample of a document that exhibits this behavior, I did not create it (nor do I know how it was created) so the only information we have is the meta data inside.

Test case, along with workaround below:

#!/usr/bin/env python
# -*- coding: windows-1252 -*-
# vim:ts=4:sw=4:softtabstop=4:smarttab:expandtab
#

import os
import sys

ver_to_test = 2
ver_to_test = 3
ver_to_test = 4

if ver_to_test == 4:
    from pypdf import PdfFileReader  # https://github.com/claird/PyPDF4
elif ver_to_test == 3:
    from PyPDF3 import PdfFileReader  # https://github.com/mstamy2/PyPDF3
else:
    from PyPDF2 import PdfFileReader  # https://github.com/mstamy2/PyPDF2 / https://pythonhosted.org/PyPDF2/

print('Python %s on %s' % (sys.version, sys.platform))

filename = 'title_bug.pdf'
f = open(filename, 'rb')
pdf = PdfFileReader(f)
info = pdf.documentInfo
#print(info)
print('title attribute %r' % info.title)  # reports None
print('title getText() %r' % info.getText("/Title"))  # this is what .title property calls
print('title get() %r' % info.get("/Title"))  # this is part of what dict[] does
print('title get().getObject() %r' % info.get("/Title").getObject())  # this is what dict[] does
print('/Title dict entry %r' % info['/Title'])  # with test pdf works
print('title attribute %r' % info.title)  # Sanity check it is still None
print('title Workaround %r' % (info.title or info['/Title']))  # Workaround
f.close()
clach04 commented 2 years ago

fixed upstream