Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.84k stars 277 forks source link

How to ignore the "#" which the <h1> -<h6>generate #185

Open feiglex opened 7 years ago

feiglex commented 7 years ago

html = "<p>hello, this is <em>html2text</em></p><strong>it is strong label</strong>" h = html2text.HTML2Text() print h.handle(html)

h.ignore_emphasis = True print h.handle(html)

html = "<h6>This is title</h6>" print h.handle(html) # value: ###### This is title`

lroolle commented 7 years ago

This is a html2markdown :XD cause we're encountering with the same problem...

screenshot from 2017-09-07 18 13 14


And, related code is right here...


        if hn(tag):
            self.p()
            if start:
                self.inheader = True
                self.o(hn(tag) * "#" + ' ')
            else:
                self.inheader = False
                return  # prevent redundant emphasis marks on headers

Hmm,,,,

https://github.com/Alir3z4/html2text/blob/aa67e1c3b78a2827cc396289139d85a33518d82c/html2text/__init__.py#L328

Alir3z4 commented 7 years ago

So you want to keep the whole thing same but only removing # from the output ?

feiglex commented 7 years ago

yeah, I just want to keep the text i see in the website. Are there any configuration to remove # ?

Alir3z4 commented 7 years ago

No, there's no option to disable the formats. We aim to generate a text that at least is able to get back to its original format, removing all format options won't be helpful.

However, for the purpose of removing all the HTML tags from the text, you can use lxml or beautifulsoap and perform stripping html tags that will give.

There's a similar feature request: https://github.com/Alir3z4/html2text/issues/170

mj-dd commented 5 years ago

this should be called html2markdown

bjd-pfq commented 3 years ago

@mj-dd

this should be called html2markdown

Well said. This is a con: it cannot perform what its name implies.