Support html2text conversion to text, both in API and command line

Many people have stuggled with how to get html2text to actually convert to text, instead of to Markdown.

Previous tickets include:

Bug #170
Bug #185
https://github.com/aaronsw/html2text/issues/71

There are also tons of stackoverflow questions on this. In some cases people recommend lxml or BeautifulSoup, but those do not do a good job of differentiating printable text from non-printable content. Also html2text extracts image alts automatically, which is critical for a screen reader and accessibility.

The following code can be used with the current API to almost get this working:

import html2text
html2text.hn = lambda _:0
h = html2text.HTML2Text()
h.images_to_alt = True
h.single_line_break = True
h.ignore_emphasis = True
h.ignore_links = True
h.ignore_tables = True

The only part I see at the moment not working is <hr> tags, although there may be something else hidden. If an option were added for addressing <hr> tags, and the hn hack were formalized as an option then that might be sufficient to fix the API. Ideally this would also be fixed with an appropriate command line alias which sums up all the various options to get text.

Hopefully all the people finding those earlier tickets will at least find this one now, which has most of the answer to the issue (for the API), regardless of whether this ticket gets addressed fully in this github.

Alir3z4 / html2text

Support html2text conversion to text, both in API and command line #359