Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.84k stars 279 forks source link

Unexpected whitespaces caused by <br> #199

Open Dunedan opened 6 years ago

Dunedan commented 6 years ago

Using the following code I get unexpected whitespaces in front of each <br> which got replaced by \n:

>>> import html2text
>>> h = html2text.HTML2Text()
>>> h.single_line_break = True
>>> h.handle("foo<br><br><br>bar")
'foo  \n  \n  \nbar\n\n'

Additionally the single_line_break setting doesn't seem to have any effect at all.

ExplodingCabbage commented 4 years ago

These aren't entirely unexpected. In some dialects of Markdown (e.g. that used on Stack Overflow), single newlines in the markdown are ordinarily ignored and don't produce newlines in the rendered HTML, and to produce a <br> you need to have two trailing spaces before the newline.

However, in some other dialects of Markdown (e.g. the one on GitHub that I'm using to write this message) that's not the case and single new lines in the Markdown get rendered as line breaks.

It'd therefore be nice to add an option that lets us choose between the two dialects.

rahil627 commented 2 years ago

i second what @ExplodingCabbage said...

i ran into this program through a tool used for migrating wordpress to jekyll: exitwp.py. It uses this library. Now, i’m trying to find a way to preserve those new-lines, but i don’t really see an option, except the single-line-break, which doesn’t help. :(

this really sucks as the migration tool is absolutely PERFECT, save this newline problem. (UNLESS... my xml file didn’t preserver line breaks in the first place. That would really suck ;( )

rahil627 commented 2 years ago

testing github newline

strange that this works here, in the issues, but not for the repo readme nor jekyll pages .