Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.75k stars 270 forks source link

Don't add line breaks inside link names #339

Closed mborsetti closed 3 years ago

mborsetti commented 3 years ago

The following HTML code

<a href="http://example.com" title="MyTitle"> first example</a>
<br>
<a href="http://example.com" ><p> second example</p></a>

is being converted by html2text version (2020, 1, 16) to the following Markdown (notice in the second example the line break before the closing ]):

[ first example](http://example.com "MyTitle")  

[ second example

](http://example.com)

or with inline_links = False:

[ first example][1]  

[ second example

][2]

   [1]: http://example.com (MyTitle)

   [2]: http://example.com

The additional line break, besides looking askew, doesn't seem to be allowed by the official specs, and indeed breaks converting Markdown back to HTML.

This PR fixes the code to produce the following correct Markdown:

[ first example](http://example.com "MyTitle")  
[ second example](http://example.com)

or with inline_links = False:

[ first example][1]  
[ second example][2]

   [1]: http://example.com (MyTitle)

   [2]: http://example.com

Python 3.9 code to replicate the above:

import html2text
print(f'{html2text.__version__=}')
data = ('<a href="http://example.com" title="MyTitle"> first example</a>\n'
        '<br>\n'
        '<a href="http://example.com" ><p> second example</p></a>')
parser = html2text.HTML2Text()
markdown = parser.handle(data)
print(markdown)
parser.inline_links = False
markdown = parser.handle(data)
print(markdown)
coveralls commented 3 years ago

Coverage Status

Coverage decreased (-1.07%) to 96.803% when pulling 0307527a415f43e4771d02e10494b7710cc6c8c6 on mborsetti:link_names into 296e6f24d16a36bf88b8042d56ebd69ec37aef9c on Alir3z4:master.

Alir3z4 commented 3 years ago

@mborsetti Thanks for taking care of it.

The code looks good to me, however it'd be much better if we could add some documentation on the code changes to understand them, basically "Why" part of the documentation.

mborsetti commented 3 years ago

@Alir3z4 Sorry I am new to contributing to the project, and the only documentation I see is usage.md, which is pretty minimal. What do you have in mind? Feel free to edit my PR at will if that's easier.

mborsetti commented 3 years ago

@Alir3z4 is there anything else you need from me before accepting this PR?

Markdown does not allow for breaks inside link names (between [ and ]), and they (obviously) break any Markdown to HTML parser such as markdown2 used in my project webchanges, which reconstructs diffed data generated by html2text back into HTML.

Alir3z4 commented 3 years ago

Sorry for delayed response.

It's all good, thanks. Could you please resolve the conflict and I'll merge right away.

mborsetti commented 3 years ago

Thx, merge done