Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.79k stars 273 forks source link

text wrapped at 78 characters ignoring body_width setting and partially ignoring -b option #315

Open jpjoines opened 4 years ago

jpjoines commented 4 years ago
    I am using html2text version 2020.1.16 with Python 3.8.1.  html2text wraps the text at 78 characters regardless of the setting of BODY_WIDTH in config.py or body_width in the parser.  I set BODY_WIDTH = 0 in config.py, then:

html = 'This line is longer than seventy-eight characters. It seems to be getting wrapped with backslash n line breaks at seventy-eight characters regardless of the body_width setting in config.py or in the parser.' len(html) 228

plainmd = html2text.html2text(html) plainmd 'This line is longer than seventy-eight characters. It seems to be getting\nwrapped with backslash n line breaks at seventy-eight characters regardless\nof the _bodywidth setting in config.py or in the parser.\n\n' len('This line is longer than seventy-eight characters. It seems to be getting') 77

parser = html2text.HTML2Text() parser.body_width = 0 parser.body_width 0 plainmd = html2text.html2text(html) plainmd 'This line is longer than seventy-eight characters. It seems to be getting\nwrapped with backslash n line breaks at seventy-eight characters regardless\nof the _bodywidth setting in config.py or in the parser.\n\n'

parser.body_width = 22 parser.body_width 22 plainmd = html2text.html2text(html) plainmd 'This line is longer than seventy-eight characters. It seems to be getting\nwrapped with backslash n line breaks at seventy-eight characters regardless\nof the _bodywidth setting in config.py or in the parser.\n\n'

parser.body_width = 99 parser.body_width 99 plainmd 'This line is longer than seventy-eight characters. It seems to be getting\nwrapped with backslash n line breaks at seventy-eight characters regardless\nof the _bodywidth setting in config.py or in the parser.\n\n'

    However, at the command line, it wraps on the screen at the value specified with -b, but still puts a newline character in after 78 characters:

$ cat /tmp/test.html This line is longer than seventy-eight characters. It seems to be getting\nwrapped with backslash n line breaks at seventy-eight characters regardless\nof the _bodywidth setting in config.py or in the parser. $ $ wc -l /tmp/test.html; wc -c /tmp/test.html 1 /tmp/test.html 217 /tmp/test.html $ $ python3.8 -m html2text -b 0 /tmp/test.html This line is longer than seventy-eight characters. It seems to be getting\nwrapped with backslash n line breaks at seventy-eight characters regardless\nof the _bodywidth setting in config.py or in the parser. $ $ echo 'This line is longer than seventy-eight characters. It seems to be getting' | wc -c 78 $ $ python3.8 -m html2text -b 22 /tmp/test.html This line is longer than seventy-eight characters. It seems to be getting\nwrapped with backslash n line breaks at seventy- eight characters regardless\nof the _bodywidth setting in config.py or in the parser.

$ python3.8 -m html2text -b 99 /tmp/test.html This line is longer than seventy-eight characters. It seems to be getting\nwrapped with backslash n line breaks at seventy-eight characters regardless\nof the _bodywidth setting in config.py or in the parser.

$

Enola-guy commented 4 years ago

You are never using the parser with which you defined the body width...you are using directly html2text.html2text(html) so basically you called the function 3 times with 3 times the same default settings

In [1]: import html2text

In [2]: html2text.__version__
Out[2]: (2020, 1, 16)

In [3]: html = 'This line is longer than seventy-eight characters. It seems to be getting wrapped with backslash n line breaks at seventy-eight characters regardless of the body_width 
   ...: setting in config.py or in the parser.'

In [4]: parser = html2text.HTML2Text()

In [5]: parser.body_width = 0

In [6]: parser.handle(html)
Out[6]: 'This line is longer than seventy-eight characters. It seems to be getting wrapped with backslash n line breaks at seventy-eight characters regardless of the body_width setting in config.py or in the parser.\n'

In [7]: parser = html2text.HTML2Text(bodywidth=0)

In [8]: parser.handle(html)
Out[8]: 'This line is longer than seventy-eight characters. It seems to be getting wrapped with backslash n line breaks at seventy-eight characters regardless of the body_width setting in config.py or in the parser.\n'

# This is getting wrapped as expected
In [9]: html2text.html2text(html)
Out[9]: 'This line is longer than seventy-eight characters. It seems to be getting\nwrapped with backslash n line breaks at seventy-eight characters regardless of\nthe body_width setting in config.py or in the parser.\n\n'