Open lopuhin opened 5 years ago
Another example:
>>> html_text.extract_text("<p><span>N</span>o one is responsible</p>", True)
N o one is responsible
I did a quick test in Chrome and it is not adding spaces between inline elements. Is there any case in which is clear that spaces should be added?
@ivanprado please see https://github.com/TeamHG-Memex/html-text/pull/2 and https://github.com/TeamHG-Memex/html-text/issues/1, there are some examples from the wild where adding spaces makes sense.
If we had information about actual CSS properties of the elements, we could do a better job, but that would probably be out of scope of html-text
I see. My experience with article bodies is that is a better policy not to add any spacing when removing span
. It seems to give much better results. But I can understand that in other parts of a page or for different pages the case could be different (Like in examples in https://github.com/TeamHG-Memex/html-text/issues/1#issue-231630496)
Example input:
<div><strong>fo</strong>o</div><div>bar</div>
. Current output:'fo o\nbar'
, while desired output is'foo\nbar'
.At the same time, in the changelog I find this:
So it's not entirely clear if we should always avoid adding spaces around all inline tags. Maybe we could start with not adding them around tags such as
strong
,em
, etc.Also note that more common usage such as
<div><strong>foo</span>, next</div>
is handled correctly regardless:foo, next
.