Closed lopuhin closed 7 years ago
Merging #2 into master will not change coverage. The diff coverage is
100%
.
@@ Coverage Diff @@
## master #2 +/- ##
=====================================
Coverage 100% 100%
=====================================
Files 2 2
Lines 26 42 +16
Branches 1 6 +5
=====================================
+ Hits 26 42 +16
Impacted Files | Coverage Δ | |
---|---|---|
html_text/html_text.py | 100% <100%> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c7ebb57...1fb2ec4. Read the comment docs.
in my form github/fluquid/html-text I tried the following:
return ' '.join(x for x in sel.xpath("//text()[normalize-space(.)]").extract() if x)
But it doesnt yet seem to work propery with within-string multi-whitespace, or newlines/tabs for that matter.
your code looks fine, but understandable that re.sub wld add overhead.
Johannes
@codinguncut it seems that the main overhead comes not from re.sub, but from iterating over //text()
results (although re.sub also has some overhead).
return ' '.join(x for x in sel.xpath("//text()[normalize-space(.)]").extract() if x)
That's an interesting approach, I'll check it out, thanks!
i'll try xpath('//*[normalize-space()]')
.
also, both solutions will add spaces before comma, period, etc. in the context of tags:
<a>mail</a>, and more
.
cant be helped i think.
To handle punctuation there is https://github.com/scrapinghub/webstruct/blob/5a3f39e2ec78a04ca021a12dff58f66686d86251/webstruct/utils.py#L61, but it may add even more overhead. It may be fine to provide it as an option though.
Ah, and it also removes all spaces before punctuation, no only caused by joining, so maybe it is not the way to go.
i'll try
xpath('//*[normalize-space()]').
@codinguncut to be honest, I didn't understand this xpath - it just returns all elements with some text, right?
To handle punctuation there is https://github.com/scrapinghub/webstruct/blob/5a3f39e2ec78a04ca021a12dff58f66686d86251/webstruct/utils.py#L61, but it may add even more overhead. It may be fine to provide it as an option though.
Nice suggestion, thanks @kmike ! I implemented this as an option in e833357 (edit: f020f4b), it works only on tag boundaries, so only spaces caused by joining would be affected. The overhead is not that huge, for 1k pages total time is 11.7 s vs 12.5 s (and 8.17 s vs 9.21 s when working on already parsed trees). Maybe it's ok to make it default?
Nice! +1 to enable punctuation handling by default. There is also a simple micro-optimization trick: instead of writing
_trailing_whitespace = re.compile(r'\s$')
# ...
if _trailing_whitespace.search(...):
One can write this, to save an attribute lookup in a tight loop:
_has_trailing_whitespace = re.compile(r'\s$').search
# ...
if _has_trailing_whitespace(...):
Thanks @codinguncut and @kmike , merged with punctuation handling enabled by default.
yes, i still don't 100% understand xpath syntax.
I was hoping to find an equivalent of //text()
for normalize-space()
, but maybe the two are simply different types of functions.
Maybe @redapple can share his experience. I think this issue is very much related to https://github.com/scrapy/parsel/pull/34.
@codinguncut , [normalize-space()
](https://www.w3.org/TR/xpath/#function-normalize-space) only applies some trimming...:
whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.
...on top of string()
, which is itself a concatenation of descendant text nodes of the context node:
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
And indeed, normalize-space()
is a string function, whereas //text()
means /descendant-or-self::node()/text()
, so it selects text nodes, and does not produce a string. Two different operations.
lxml/parsel produces smart-strings out of text nodes, so they can be concatenated.
@lopuhin , @kmike , this may not be the right issue for my comment here as it may be more about doing html2text-ish transformation than plaintext, but what I found handy in the past was:
normalize-space()
gobbles upYou can find some (ugly, pre-knowing-of-normalize-space) code in parslepy.
@codecov-io @lopuhin , you may also be interested in this answer I wrote some time ago on whitespace and XPath's normalize-space()
vs. Python's strip()
: https://stackoverflow.com/a/33829869/
@redapple that's really interesting and useful, thanks @redapple ! I think we should also try to strip them - 85% of the sample html pages have at least one non-breaking space extracted.
Fixes #1 - see examples by @codinguncut there. Inline tags are commonly used as block tags, and current
normalize-space()
results in unwanted joining of words - this branch fixes it by always adding whitespace between tags.Checked old vs. new way on about 1000 html pages, on average the text is longer by 0.2% characters, with most pages having some difference. In all cases I checked (about 10 pages) the new way is better, unsplitting words that were joined without spaces, and I didn't find any unwanted splits.
The speed is almost 2x slower though: 7 s for 1000 html pages before, 11.5 s without regexp, 12.5 s with regexp (and caching). But I guess it's not that bad.
@codinguncut @kmike I would appreciate your review :) I have some vague memory that in some cases
//text()
is not what we want, but I can't recall in which and I didn't see anything bad in the tests I did.