Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.79k stars 273 forks source link

html2text through pipe : AttributeError: 'str' object has no attribute 'decode' #287

Closed sebma closed 4 years ago

sebma commented 5 years ago

Hi,

I'm using :

$ html2text --version
2019.8.11
$ head -1 $(which html2text)
#!/usr/bin/python3.6
$ /usr/bin/python3.6 -V
Python 3.6.8

The URL I want to translate to text is http://www.peter-adam.com/jpv/JPV_Titles.php, it contains Chinese utf-8 text :

 curl -qs http://www.peter-adam.com/jpv/JPV_Titles.php | html2text 
Traceback (most recent call last):
  File "/usr/local/bin/html2text", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/html2text/cli.py", line 262, in main
    data = data.decode(args.encoding, args.decode_errors)
AttributeError: 'str' object has no attribute 'decode'

So I tried it with http://www.google.fr but it does not work either :

 curl -qs http://www.google.fr | html2text 
Traceback (most recent call last):
  File "/usr/local/bin/html2text", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/html2text/cli.py", line 259, in main
    data = wrap_read()
  File "/usr/local/lib/python3.6/dist-packages/html2text/utils.py", line 203, in wrap_read
    return sys.stdin.read()
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5869: invalid continuation byte

Can you help me ?

aucampia commented 5 years ago

I get the same error, the problem is how stdin reading is handled for python 3.

$ curl --silent https://opensource.org/licenses/MIT | html2text - | head
Traceback (most recent call last):
  File "/home/iwana/.local/bin/html2text", line 10, in <module>
    sys.exit(main())
  File "/home/iwana/.local/lib/python3.7/site-packages/html2text/cli.py", line 262, in main
    data = data.decode(args.encoding, args.decode_errors)
AttributeError: 'str' object has no attribute 'decode'

Workaround for now:

$ curl --silent https://opensource.org/licenses/MIT | html2text /dev/stdin | head
Skip to main content

  * [Home](/)
  * [From the Board](/blog)
  * [Contact](/contact)
  * [Donate](/civicrm/contribute/transact?reset=1&id=2)
  * [Login](/user/login)

## Search form

Note that for python2 it works fine.

sebma commented 5 years ago

@aucampla Thanks, I've switched html2text to python2.

jdufresne commented 4 years ago

Thanks for reporting. This has been fixed on the master branch and will be in the next release.

iavael commented 4 years ago

@jdufresne could you tell, what was the commit with fix, please?

jdufresne commented 4 years ago

I believe it is b361467894fb277563b4547ec9d4df49f5e0c6e3

sergiomb2 commented 4 years ago

Hi html2text-2019.9.26 fail one test on Centos 7 (1 failed, 165 passed in 6.34 seconds )

test_command[/builddir/build/BUILD/html2text-2019.9.26/test/bodywidth_newline.html-cmdline_args10]

/usr/lib64/python3.6/subprocess.py:438: CalledProcessError
----------------------------- Captured stderr call -----------------------------
Traceback (most recent call last):
File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/builddir/build/BUILD/html2text-2019.9.26/html2text/__main__.py", line 3, in <module>
main()
File "/builddir/build/BUILD/html2text-2019.9.26/html2text/cli.py", line 306, in main
sys.stdout.write(h.handle(data))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2032' in position 224: ordinal not in range(128)