euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.26k stars 1.13k forks source link

Problem converting pdf to txt with pdf2txt.py #104

Open JSB97 opened 9 years ago

JSB97 commented 9 years ago

I am trying to convert the following pdf to txt. http://www.kabupro.jp/edp/20140529/S1001UPO.pdf

Using the following command pdf2txt.py -o text.txt S1001UPO.pdf

The document is encrypted so i remove this first; however, even after doing this i get the below error.

I suspect the issue is with "TypeError: must be encoded string without NULL bytes, not str", to which this seems to offer a solution - http://stackoverflow.com/questions/18265084/typeerror-must-be-string-without-null-bytes-not-str

Could someone point me to a work around? Thank you!!

Traceback (most recent call last): File "/Users/JB1/anaconda/bin/pdf2txt.py", line 115, in if name == 'main': sys.exit(main(sys.argv)) File "/Users/JB1/anaconda/bin/pdf2txt.py", line 109, in main interpreter.process_page(page) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 833, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 844, in render_contents self.init_resources(resources) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 348, in init_resources self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 196, in get_font font = self.get_font(None, subspec) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 187, in get_font font = PDFCIDFont(self, spec) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdffont.py", line 668, in init self.unicode_map = CMapDB.get_unicode_map(self.cidcoding, self.cmap.is_vertical()) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/cmapdb.py", line 276, in get_unicode_map data = klass._load_data('to-unicode-%s' % name) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/cmapdb.py", line 247, in _load_data if os.path.exists(path): File "/Users/JB1/anaconda/lib/python2.7/genericpath.py", line 18, in exists os.stat(path) TypeError: must be encoded string without NULL bytes, not str

tataganesh commented 7 years ago

@JSB97 I have also encountered the same error. The problematic snippet in cmapdb.py seems to be -

    def _load_data(klass, name):
        filename = '%s.pickle.gz' % name
        if klass.debug:
            print >>sys.stderr, 'loading:', name
        cmap_paths = (os.environ.get('CMAP_PATH', '/usr/share/pdfminer/'),
                      os.path.join(os.path.dirname(__file__), 'cmap'),)
        for directory in cmap_paths:
            path = os.path.join(directory, filename)

Printing the variable "filename" gives me - to-unicode-PDFXC30-Identity.pickle.gz Printing "repr(filename)" yields - 'to-unicode-PDFXC30-Identity\x00\x00.pickle.gz' Apparently, these \x00 characters are causing the issue. One fix that solved this issue for me was - filename = filename.replace('\0', '') I am not sure what is causing this issue, though. @euske Is there a way to make a permanent fix for this?

tataganesh commented 7 years ago

A fork of the repository pdfminer.six has been created at - https://github.com/strideai/pdfminer.six . This issue has been fixed in this fork, and we will now be maintaining the forked repository.

softboy99 commented 1 year ago

Hi @tataganesh , after test still failed. simple1.pdf