Open JulienPalard opened 3 years ago
Hi! I'm currently dealing with a similar issue which pointed me to the escape_md_section
function and then this thread. Same specs that were pointed out by @JulienPalard
import html2text
to_convert = '<a href="[% manage %]" >Manage</a> your preferences'
text_maker = html2text.HTML2Text()
text_maker.images_to_alt = True
text_maker.ignore_emphasis = True
text_maker.bypass_tables = True
text_maker.ignore_tables = True
text_maker.links_each_paragraph = True
text_maker.wrap_links = False
text_maker.body_width = 2000
converted = text_maker.handle(to_convert)
print(converted)
'[Manage](\\[% manage %\\]) your preferences'
When I'd expect something like:
[Manage]([% manage %]) your preferences
In fact, this behaviour seems pretty tied to markdown when what I only want is plain-text from an html. Is there any way to turn down the markdown conversion? I know that's one of the goals of this library so it's a tough/dumb question :)
I don't think html2text needs to "get away from markdown" to get this fixed, there's no need for backslahes in this place for this to be proper Markdown (even if the function has markdown in its name). The function is just too cautious and adds backslahes where it's not really needed.
Is there any way to turn down the markdown conversion? I know that's one of the goals of this library so it's a tough/dumb question :)
there is, actually. It’s beautifulsoup4’s get_text()
method
Hi!
html2text --version
: 2020.1.16<img src="/bar.png"/> + direction
python --version
Python 3.9.0
Given
<img src="/bar.png"/> + direction
, html2text give an unexpected backslash:This is due to
+ direction
received byhandle_data
, it's given toescape_md_section
, which escapes the+
to avoid it being interpreted as a unordered list item if I understand correctly. But here, the+
, not being at the beginning of the line can't be interpreted as a list (or can it? ¹) so I think it should not be escaped, what do you think?In fact there's cases where a list can be interepreted as so while not being sticked to the beginning of the line, for example in a quote:
But - Here its - not a - list no - need to escape