euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.25k stars 1.13k forks source link

AttributeError: 'PDFObjRef' object has no attribute 'decode' Ask Question #249

Open swoltron opened 5 years ago

swoltron commented 5 years ago

I am using pdfminer's pdf2txt.py to extract text from different pdf's. The algorithm works very well in a lot of scenarios, but I am getting this error and I'm not sure what I can do to get pdfminer to work.

AttributeError: 'PDFObjRef' object has no attribute 'decode'

I have run this same command on other documents and it is only recently that I started seeing this.

I am simply running this off of the command line:

pdf2txt.py -t xml -F -1.0 test.pdf

This is the complete output from pdf2txt.py:

<?xml version="1.0" encoding="utf-8" ?>
<pages>
Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 116, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/usr/local/bin/pdf2txt.py", line 110, in main
    interpreter.process_page(page)
  File "/Library/Python/2.7/site-packages/pdfminer2-20151206-py2.7.egg/pdfminer/pdfinterp.py", line 834, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/Library/Python/2.7/site-packages/pdfminer2-20151206-py2.7.egg/pdfminer/pdfinterp.py", line 844, in render_contents
    self.init_resources(resources)
  File "/Library/Python/2.7/site-packages/pdfminer2-20151206-py2.7.egg/pdfminer/pdfinterp.py", line 350, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/Library/Python/2.7/site-packages/pdfminer2-20151206-py2.7.egg/pdfminer/pdfinterp.py", line 200, in get_font
    font = self.get_font(None, subspec)
  File "/Library/Python/2.7/site-packages/pdfminer2-20151206-py2.7.egg/pdfminer/pdfinterp.py", line 191, in get_font
    font = PDFCIDFont(self, spec)
  File "/Library/Python/2.7/site-packages/pdfminer2-20151206-py2.7.egg/pdfminer/pdffont.py", line 643, in __init__
    self.cidcoding = '%s-%s' % (self.cidsysteminfo.get('Registry', b'unknown').decode("latin1"),
AttributeError: 'PDFObjRef' object has no attribute 'decode'

Any help is appreciated!

Yoshimasa-Kikuchi commented 3 years ago

Could you please confirm "pip freeze"? If you have library "pdfminer" & "pdfminer.six", "pip uninstall pdfminer" and "pip uninstall pdfminer.six". Then all clear, "pip install pdfminer.six".

Yoshimasa-Kikuchi commented 3 years ago

Sorry... I have python3.7.7env. I cannot give you any solution.

renatoromalves commented 3 years ago

I'm having the same problem, but only with files that was saved through "Microsoft Print to PDF". I'm trying to convert to text a table that is converted to PDF. If I just save as pdf it works, if I print as PDF (through this "printer"), it doesn't. Hope it helps to solve this issue.

reema-dass26 commented 1 year ago

hi, I am facing this error, but unfortunately i cant modify the pdf file, so i need to handle this programatically, Could you guide me if you have resolved it? My metadata has this as a field value: {'q': , 'Q': } and after i resolve it , it converts to {'q': <PDFStream(65): raw=3, {'Length': 3}>, 'Q': <PDFStream(64): raw=3, {'Length': 3}>} I am not sure how to proceed with this.

pamelabarylski commented 1 year ago

I am also facing this error on some PDF files. I was able to duplicate the "fix" of using a PDF created using "save as" instead of "print to PDF", but like reema-dass26, I don't always have the ability to do that. I can't believe we are the only two that have this problem....