Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.76k stars 270 forks source link

Unexpected backslash #347

Open JulienPalard opened 3 years ago

JulienPalard commented 3 years ago

Hi!

Given <img src="/bar.png"/> + direction, html2text give an unexpected backslash:

$ cat html
<img src="/bar.png"/> + direction
$ html2text html
![](/bar.png) \+ direction

This is due to + direction received by handle_data, it's given to escape_md_section, which escapes the + to avoid it being interpreted as a unordered list item if I understand correctly. But here, the +, not being at the beginning of the line can't be interpreted as a list (or can it? ¹) so I think it should not be escaped, what do you think?

In fact there's cases where a list can be interepreted as so while not being sticked to the beginning of the line, for example in a quote:

  • a quoted
  • unordered
  • list

But - Here its - not a - list no - need to escape

g-piffa commented 3 years ago

Hi! I'm currently dealing with a similar issue which pointed me to the escape_md_section function and then this thread. Same specs that were pointed out by @JulienPalard

import html2text

to_convert = '<a href="[% manage %]" >Manage</a> your preferences'

text_maker = html2text.HTML2Text()
text_maker.images_to_alt = True
text_maker.ignore_emphasis = True
text_maker.bypass_tables = True
text_maker.ignore_tables = True
text_maker.links_each_paragraph = True
text_maker.wrap_links = False
text_maker.body_width = 2000

converted = text_maker.handle(to_convert)

print(converted)
'[Manage](\\[% manage %\\]) your preferences'

When I'd expect something like:

[Manage]([% manage %]) your preferences

In fact, this behaviour seems pretty tied to markdown when what I only want is plain-text from an html. Is there any way to turn down the markdown conversion? I know that's one of the goals of this library so it's a tough/dumb question :)

JulienPalard commented 3 years ago

I don't think html2text needs to "get away from markdown" to get this fixed, there's no need for backslahes in this place for this to be proper Markdown (even if the function has markdown in its name). The function is just too cautious and adds backslahes where it's not really needed.

ThatXliner commented 3 years ago

Is there any way to turn down the markdown conversion? I know that's one of the goals of this library so it's a tough/dumb question :)

there is, actually. It’s beautifulsoup4’s get_text() method