Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.81k stars 273 forks source link

Extra space after a closing emphasis mark #405

Open ropery opened 9 months ago

ropery commented 9 months ago
$ echo '<em>hello</em>'{\,,\",:,\[,.,\!,\?}'<br>' | html2text
_hello_ ,  
_hello_ "  
_hello_ :  
_hello_[  
_hello_.  
_hello_!  
_hello_?  

Note in the first three lines of the output, there is an extra space after the closing _ emphasis mark.

This is a bug, because Markdown has no problem with a punctuation immediately following the closing emphasis mark:

$ echo _hello_{\,,\",:,\[,.,\!,\?} | markdown
<p><em>hello</em>, <em>hello</em>&ldquo; <em>hello</em>: <em>hello</em>[ <em>hello</em>. <em>hello</em>! <em>hello</em>?</p>

The same rendered by GitHub: hello, hello" hello: hello[ hello. hello! hello?

I guess the extra space is added here:

https://github.com/Alir3z4/html2text/blob/099c4b8bfeea09d640e18324bb1d44f051371940/html2text/__init__.py#L295-L297

Or here, which explains why the bottom four results don't have the extra space:

https://github.com/Alir3z4/html2text/blob/099c4b8bfeea09d640e18324bb1d44f051371940/html2text/__init__.py#L860-L868

ropery commented 9 months ago

I would like to add, that maybe we should simply not add extra spaces around stressed text:

$ for i in _ \* __ \*\*; do echo "${i}foo${i}bar${i}baz${i}"; done
_foo_bar_baz_
*foo*bar*baz*
__foo__bar__baz__
**foo**bar**baz**

My markdown produces:

$ for i in _ \* __ \*\*; do echo "${i}foo${i}bar${i}baz${i}" | markdown; done
<p><em>foo_bar_baz</em></p>
<p><em>foo</em>bar<em>baz</em></p>
<p><strong>foo</strong>bar<strong>baz</strong></p>
<p><strong>foo</strong>bar<strong>baz</strong></p>

But GitHub's rendering disagrees for the third __foo__bar__baz__: _foo_barbaz foobarbaz foobarbaz foobarbaz

$ for i in _ \* __ \*\*; do echo "${i}foo${i}bar${i}baz${i}" | markdown | html2text; done
_foo_bar_baz_

_foo_ bar _baz_

**foo** bar**baz**

**foo** bar**baz**

So it seems, if we want to add extra spaces, it would be only when the stress mark is _ or __ -- * and ** don't require extra spaces for Markdown to apply the stress, e.g., ***a**b* -> ab = ok

-- which leads to the question: should -e be the default, or maybe automatically use * in where _ would require extra spaces (thereby irreversibly distorting the text).