html or xml converter fails with TypeError: write() argument must be str, not bytes

euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

https://github.com/pdfminer/pdfminer.six

MIT License

5.24k stars 1.13k forks source link

html or xml converter fails with TypeError: write() argument must be str, not bytes #269

Open Prasaddiwalkar opened 4 years ago

Prasaddiwalkar commented 4 years ago

python pdf2txt.py -t xml -o output.xml -d %pdffilepath% fails with following error

Traceback (most recent call last):
  File "pdf2text.py", line 113, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "pdf2text.py", line 92, in main
    device = XMLConverter(rsrcmgr, outfp, laparams=laparams, imagewriter=imagewriter, stripcontrol=stripcontrol)
  File "C:\apps\Python37\lib\site-packages\pdfminer\converter.py", line 442, in __init__
    self.write_header()
  File "C:\apps\Python37\lib\site-packages\pdfminer\converter.py", line 453, in write_header
    self.write('<?xml version="1.0" encoding="%s" ?>\n' % self.codec)
  File "C:\apps\Python37\lib\site-packages\pdfminer\converter.py", line 448, in write
    self.outfp.write(text)
TypeError: write() argument must be str, not bytes

6A61736F6E206E61646572 commented 4 years ago

Likely related - dumppdf.py also fails similarly:

Traceback (most recent call last):
  File "/home/wynand/.virtualenvs/pdf/bin/dumppdf.py", line 272, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/home/wynand/.virtualenvs/pdf/bin/dumppdf.py", line 269, in main
    dumpall=dumpall, mode=mode, extractdir=extractdir)
  File "/home/wynand/.virtualenvs/pdf/bin/dumppdf.py", line 222, in dumppdf
    dumptrailers(outfp, doc)
  File "/home/wynand/.virtualenvs/pdf/bin/dumppdf.py", line 95, in dumptrailers
    out.write('<trailer>\n')
TypeError: a bytes-like object is required, not 'str'

Prasaddiwalkar commented 4 years ago

yes I also observed the same for dump as well

Traceback (most recent call last):
  File "dumppdf.py", line 272, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "dumppdf.py", line 269, in main
    dumpall=dumpall, mode=mode, extractdir=extractdir)
  File "dumppdf.py", line 222, in dumppdf
    dumptrailers(outfp, doc)
  File "dumppdf.py", line 95, in dumptrailers
    out.write('<trailer>\n')
TypeError: a bytes-like object is required, not 'str'

6A61736F6E206E61646572 commented 4 years ago

Fails with the sample PDFs provided by the repo so it's not an issue with our files then.

6A61736F6E206E61646572 commented 4 years ago

pdfminer.six works, going to use that for now.

Prasaddiwalkar commented 4 years ago

yes pdfminer.six is working for me as well but it gives me node for each character.

I am expecting it should give me text node for each word or each line

Prasaddiwalkar commented 4 years ago

in pdfminer.six it does not maintain the sequence of text from pdf flie for text and xml

wvanrensburg commented 4 months ago

Any update here?