Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.81k stars 272 forks source link

Support html2text conversion to text, both in API and command line #359

Open johnkw opened 3 years ago

johnkw commented 3 years ago

Many people have stuggled with how to get html2text to actually convert to text, instead of to Markdown.

Previous tickets include:

There are also tons of stackoverflow questions on this. In some cases people recommend lxml or BeautifulSoup, but those do not do a good job of differentiating printable text from non-printable content. Also html2text extracts image alts automatically, which is critical for a screen reader and accessibility.

The following code can be used with the current API to almost get this working:

import html2text
html2text.hn = lambda _:0
h = html2text.HTML2Text()
h.images_to_alt = True
h.single_line_break = True
h.ignore_emphasis = True
h.ignore_links = True
h.ignore_tables = True

The only part I see at the moment not working is <hr> tags, although there may be something else hidden. If an option were added for addressing <hr> tags, and the hn hack were formalized as an option then that might be sufficient to fix the API. Ideally this would also be fixed with an appropriate command line alias which sums up all the various options to get text.

Hopefully all the people finding those earlier tickets will at least find this one now, which has most of the answer to the issue (for the API), regardless of whether this ticket gets addressed fully in this github.

jeremydouglass commented 3 years ago

See also an option of instead using pandocs plain-text output, with example added here: https://github.com/Alir3z4/html2text/issues/170#issuecomment-900437450