buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
https://github.com/buriy/python-readability
Apache License 2.0
2.65k stars 348 forks source link

Break Inline tags #143

Open Amecom opened 4 years ago

Amecom commented 4 years ago

I noticed that sequences like

<div> <span> A <em> B </em> C </span> </div>

It is transformed into

<p> A </p> <em> B </em> <p> C </p>

Instead of something like:

<p> A <em> B </em> C </p>

This causes annoying break lines in the text

You can see an example by parsing this article from an Italian newspaper:

https://www.repubblica.it/tecnologia/sicurezza/2020/04/01/news/zoom_dalla_privacy_ai_troll_tutti_i_guai_dell_app_per_videochat-252877928/