TeamHG-Memex / html-text

Extract text from HTML
MIT License
130 stars 24 forks source link

Fix unwanted joins for inline tags #2

Closed lopuhin closed 7 years ago

lopuhin commented 7 years ago

Fixes #1 - see examples by @codinguncut there. Inline tags are commonly used as block tags, and current normalize-space() results in unwanted joining of words - this branch fixes it by always adding whitespace between tags.

Checked old vs. new way on about 1000 html pages, on average the text is longer by 0.2% characters, with most pages having some difference. In all cases I checked (about 10 pages) the new way is better, unsplitting words that were joined without spaces, and I didn't find any unwanted splits.

The speed is almost 2x slower though: 7 s for 1000 html pages before, 11.5 s without regexp, 12.5 s with regexp (and caching). But I guess it's not that bad.

@codinguncut @kmike I would appreciate your review :) I have some vague memory that in some cases //text() is not what we want, but I can't recall in which and I didn't see anything bad in the tests I did.

codecov-io commented 7 years ago

Codecov Report

Merging #2 into master will not change coverage. The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master     #2   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files           2      2           
  Lines          26     42   +16     
  Branches        1      6    +5     
=====================================
+ Hits           26     42   +16
Impacted Files Coverage Δ
html_text/html_text.py 100% <100%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c7ebb57...1fb2ec4. Read the comment docs.

codinguncut commented 7 years ago

in my form github/fluquid/html-text I tried the following: return ' '.join(x for x in sel.xpath("//text()[normalize-space(.)]").extract() if x)

But it doesnt yet seem to work propery with within-string multi-whitespace, or newlines/tabs for that matter.

your code looks fine, but understandable that re.sub wld add overhead.

Johannes

lopuhin commented 7 years ago

@codinguncut it seems that the main overhead comes not from re.sub, but from iterating over //text() results (although re.sub also has some overhead).

return ' '.join(x for x in sel.xpath("//text()[normalize-space(.)]").extract() if x)

That's an interesting approach, I'll check it out, thanks!

codinguncut commented 7 years ago

i'll try xpath('//*[normalize-space()]').

also, both solutions will add spaces before comma, period, etc. in the context of tags: <a>mail</a>, and more. cant be helped i think.

kmike commented 7 years ago

To handle punctuation there is https://github.com/scrapinghub/webstruct/blob/5a3f39e2ec78a04ca021a12dff58f66686d86251/webstruct/utils.py#L61, but it may add even more overhead. It may be fine to provide it as an option though.

kmike commented 7 years ago

Ah, and it also removes all spaces before punctuation, no only caused by joining, so maybe it is not the way to go.

lopuhin commented 7 years ago

i'll try xpath('//*[normalize-space()]').

@codinguncut to be honest, I didn't understand this xpath - it just returns all elements with some text, right?

To handle punctuation there is https://github.com/scrapinghub/webstruct/blob/5a3f39e2ec78a04ca021a12dff58f66686d86251/webstruct/utils.py#L61, but it may add even more overhead. It may be fine to provide it as an option though.

Nice suggestion, thanks @kmike ! I implemented this as an option in e833357 (edit: f020f4b), it works only on tag boundaries, so only spaces caused by joining would be affected. The overhead is not that huge, for 1k pages total time is 11.7 s vs 12.5 s (and 8.17 s vs 9.21 s when working on already parsed trees). Maybe it's ok to make it default?

kmike commented 7 years ago

Nice! +1 to enable punctuation handling by default. There is also a simple micro-optimization trick: instead of writing

_trailing_whitespace = re.compile(r'\s$')
# ...
if _trailing_whitespace.search(...):

One can write this, to save an attribute lookup in a tight loop:

_has_trailing_whitespace = re.compile(r'\s$').search
# ...
if _has_trailing_whitespace(...):
lopuhin commented 7 years ago

Thanks @codinguncut and @kmike , merged with punctuation handling enabled by default.

codinguncut commented 7 years ago

yes, i still don't 100% understand xpath syntax. I was hoping to find an equivalent of //text() for normalize-space(), but maybe the two are simply different types of functions.

kmike commented 7 years ago

Maybe @redapple can share his experience. I think this issue is very much related to https://github.com/scrapy/parsel/pull/34.

redapple commented 7 years ago

@codinguncut , [normalize-space()](https://www.w3.org/TR/xpath/#function-normalize-space) only applies some trimming...:

whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.

...on top of string(), which is itself a concatenation of descendant text nodes of the context node:

The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.

And indeed, normalize-space() is a string function, whereas //text() means /descendant-or-self::node()/text(), so it selects text nodes, and does not produce a string. Two different operations. lxml/parsel produces smart-strings out of text nodes, so they can be concatenated.

@lopuhin , @kmike , this may not be the right issue for my comment here as it may be more about doing html2text-ish transformation than plaintext, but what I found handy in the past was:

You can find some (ugly, pre-knowing-of-normalize-space) code in parslepy.

redapple commented 7 years ago

@codecov-io @lopuhin , you may also be interested in this answer I wrote some time ago on whitespace and XPath's normalize-space() vs. Python's strip(): https://stackoverflow.com/a/33829869/

lopuhin commented 7 years ago

@redapple that's really interesting and useful, thanks @redapple ! I think we should also try to strip them - 85% of the sample html pages have at least one non-breaking space extracted.