TeamHG-Memex / html-text

Extract text from HTML
MIT License
130 stars 24 forks source link

Don't always insert spaces around inline tags? #16

Open lopuhin opened 5 years ago

lopuhin commented 5 years ago

Example input: <div><strong>fo</strong>o</div><div>bar</div>. Current output: 'fo o\nbar', while desired output is 'foo\nbar'.

At the same time, in the changelog I find this:

Fix unwanted joins of words with inline tags: spaces are added for inline tags too, but a heuristic is used to preserve punctuation without extra spaces.

So it's not entirely clear if we should always avoid adding spaces around all inline tags. Maybe we could start with not adding them around tags such as strong, em, etc.

Also note that more common usage such as <div><strong>foo</span>, next</div> is handled correctly regardless: foo, next.

ivanprado commented 5 years ago

Another example:

>>> html_text.extract_text("<p><span>N</span>o one is responsible</p>", True)
N o one is responsible

I did a quick test in Chrome and it is not adding spaces between inline elements. Is there any case in which is clear that spaces should be added?

lopuhin commented 5 years ago

@ivanprado please see https://github.com/TeamHG-Memex/html-text/pull/2 and https://github.com/TeamHG-Memex/html-text/issues/1, there are some examples from the wild where adding spaces makes sense.

lopuhin commented 5 years ago

If we had information about actual CSS properties of the elements, we could do a better job, but that would probably be out of scope of html-text

ivanprado commented 5 years ago

I see. My experience with article bodies is that is a better policy not to add any spacing when removing span. It seems to give much better results. But I can understand that in other parts of a page or for different pages the case could be different (Like in examples in https://github.com/TeamHG-Memex/html-text/issues/1#issue-231630496)