Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.74k stars 266 forks source link

--ignore-links flag creates new composite words in output #389

Open strizhechenko opened 2 years ago

strizhechenko commented 2 years ago

Hi! I'm doing some natural language processing experiments and using html2text to make text sources out of internet pages. My problem is words in links are sticking to each other if I use --ignore-link flag:

html2text --ignore-links <<< '<a href="/1">1</a><a href="/2">2</a><a href="/3">3 4</a><a href="/5">5</a>'
123 45

example is specially simplified of course, but it "creates" new composite words. I've patched it locally to add spaces after each ignored link, sort of workaround with minimal changes:

if tag == "a" and self.ignore_links and not start:
    self.o(" ")
if tag == "a" and not self.ignore_links:

and it produces what I need:

html2text --ignore-links <<< '<a href="/1">1</a><a href="/2">2</a><a href="/3">3 4</a><a href="/5">5</a>'
1 2 3 4 5

Should I open a pull request for this? The code above is sort of workaround, but if it will be useful - I'd be happy to make it cleaner, add tests, changelog, etc.