Closed Kebniss closed 6 years ago
The current implementation is ~20% faster than the previous one. I tested extraction on 5 different webpages wit a lot of texts (articles or forums) and the average time for the old approach using selectors is 0.012s while for the new one is 0.010
Hey @Kebniss! Discussion about punctuation handling is here: https://github.com/TeamHG-Memex/html-text/pull/2.
1) It shouldn't depend on guess_punct_space = False
option, because this is how HTML works: if there are several whitespaces or newlines inside an element, they're collapsed to one. There are some exceptions, (<pre>
elements, or overridden CSS styles), but general rule is to collapse. See https://developer.mozilla.org/en-US/docs/Web/CSS/white-space.
2) Check https://github.com/TeamHG-Memex/html-text/blob/483686543aaf7d13fb6ce69100d7d6e9b4ba226d/html_text/html_text.py#L88 for the motivation. This heuristic produces better text in practice. Special handling of punctuation is added to make it work more reliably.
3) We're maintaining a whitespace because browsers maintain it.
So for me it looks like all failures are real.
Merging #9 into master will decrease coverage by
1.94%
. The diff coverage is98.73%
.
@@ Coverage Diff @@
## master #9 +/- ##
==========================================
- Coverage 100% 98.05% -1.95%
==========================================
Files 2 2
Lines 42 103 +61
Branches 6 28 +22
==========================================
+ Hits 42 101 +59
- Misses 0 2 +2
Impacted Files | Coverage Δ | |
---|---|---|
html_text/__init__.py | 100% <100%> (ø) |
:arrow_up: |
html_text/html_text.py | 98.03% <98.71%> (-1.97%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 4836865...05b979a. Read the comment docs.
This might be unrelated, but I wonder if we add a space at place of <wbr>
tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr - ideally we shouldn't make it (currently html-text does add a space).
I haven't checked everything, but a question about benchmarks: why is selector_to_text slower than extract_text? I expected it to be the other way around, as selector_to_text shouldn't parse html.
In the benchmark for selector I included the time to create the selector. Without it the average execution time for the new implementation is avg: 0.00372 while for the old implementation it takes avg: 0.00821. In this case this implementation is ~ 55% faster than master
This might be unrelated, but I wonder if we add a space at place of
<wbr>
tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr - ideally we shouldn't make it (currently html-text does add a space).
I can add this in the following pr
I can add this in the following pr
Thanks, that would be great if this does not blow up the scope :)
I tested the new extraction algorithm on 1200 htmls and timed extraction performance:
extract_text(html, guess_page_layout=True, guess_punct_space=True)
:html_text.selector_to_text(sel, guess_punct_space=True, guess_page_layout=True)
:Did the same on the old extraction algorithm:
extract_text(html, guess_punct_space=True)
:text = html_text.selector_to_text(sel, guess_punct_space=True)
:html extraction speed has improved of ~20% while selector extraction improved of ~6%
TODO: