Open JSB97 opened 9 years ago
@JSB97 I have also encountered the same error. The problematic snippet in cmapdb.py seems to be -
def _load_data(klass, name):
filename = '%s.pickle.gz' % name
if klass.debug:
print >>sys.stderr, 'loading:', name
cmap_paths = (os.environ.get('CMAP_PATH', '/usr/share/pdfminer/'),
os.path.join(os.path.dirname(__file__), 'cmap'),)
for directory in cmap_paths:
path = os.path.join(directory, filename)
Printing the variable "filename" gives me -
to-unicode-PDFXC30-Identity.pickle.gz
Printing "repr(filename)" yields -
'to-unicode-PDFXC30-Identity\x00\x00.pickle.gz'
Apparently, these \x00 characters are causing the issue. One fix that solved this issue for me was -
filename = filename.replace('\0', '')
I am not sure what is causing this issue, though.
@euske Is there a way to make a permanent fix for this?
A fork of the repository pdfminer.six has been created at - https://github.com/strideai/pdfminer.six . This issue has been fixed in this fork, and we will now be maintaining the forked repository.
Hi @tataganesh , after test still failed. simple1.pdf
I am trying to convert the following pdf to txt. http://www.kabupro.jp/edp/20140529/S1001UPO.pdf
Using the following command pdf2txt.py -o text.txt S1001UPO.pdf
The document is encrypted so i remove this first; however, even after doing this i get the below error.
I suspect the issue is with "TypeError: must be encoded string without NULL bytes, not str", to which this seems to offer a solution - http://stackoverflow.com/questions/18265084/typeerror-must-be-string-without-null-bytes-not-str
Could someone point me to a work around? Thank you!!
Traceback (most recent call last): File "/Users/JB1/anaconda/bin/pdf2txt.py", line 115, in
if name == 'main': sys.exit(main(sys.argv))
File "/Users/JB1/anaconda/bin/pdf2txt.py", line 109, in main
interpreter.process_page(page)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 833, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 844, in render_contents
self.init_resources(resources)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 348, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 196, in get_font
font = self.get_font(None, subspec)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 187, in get_font
font = PDFCIDFont(self, spec)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdffont.py", line 668, in init
self.unicode_map = CMapDB.get_unicode_map(self.cidcoding, self.cmap.is_vertical())
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/cmapdb.py", line 276, in get_unicode_map
data = klass._load_data('to-unicode-%s' % name)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/cmapdb.py", line 247, in _load_data
if os.path.exists(path):
File "/Users/JB1/anaconda/lib/python2.7/genericpath.py", line 18, in exists
os.stat(path)
TypeError: must be encoded string without NULL bytes, not str