Kebniss commented 6 years ago

I tested the new extraction algorithm on 1200 htmls and timed extraction performance:

extract_text(html, guess_page_layout=True, guess_punct_space=True):
- avg extraction time per html: 0.00769ms
- fastest: 0.00006ms
- slowest: 0.07818ms
html_text.selector_to_text(sel, guess_punct_space=True, guess_page_layout=True):
- avg extraction time per html: 0.01368ms
- fastest: 0.00012ms
- slowest: 0.12133ms

Did the same on the old extraction algorithm:

extract_text(html, guess_punct_space=True):
- avg extraction time per html: 0.00941ms
- fastest: 0.00011ms
- slowest: 0.15546ms
text = html_text.selector_to_text(sel, guess_punct_space=True):
- avg extraction time per html: 0.01445ms
- fastest:0.00016ms
- slowest: 0.12790ms

html extraction speed has improved of ~20% while selector extraction improved of ~6%

TODO:

[x] add multiple newlines
[x] add backward compatibility with selectors
[x] handle more tags if guess_page_layout=True
[x] add tests on real webpages with guess_page_layout=True
[x] handle nested tags without text when guess_page_layout=True
[x] update readme
[x] add performance evaluation

Kebniss commented 6 years ago

The current implementation is ~20% faster than the previous one. I tested extraction on 5 different webpages wit a lot of texts (articles or forums) and the average time for the old approach using selectors is 0.012s while for the new one is 0.010

kmike commented 6 years ago

Hey @Kebniss! Discussion about punctuation handling is here: https://github.com/TeamHG-Memex/html-text/pull/2.

1) It shouldn't depend on guess_punct_space = False option, because this is how HTML works: if there are several whitespaces or newlines inside an element, they're collapsed to one. There are some exceptions, (<pre> elements, or overridden CSS styles), but general rule is to collapse. See https://developer.mozilla.org/en-US/docs/Web/CSS/white-space.

2) Check https://github.com/TeamHG-Memex/html-text/blob/483686543aaf7d13fb6ce69100d7d6e9b4ba226d/html_text/html_text.py#L88 for the motivation. This heuristic produces better text in practice. Special handling of punctuation is added to make it work more reliably.

3) We're maintaining a whitespace because browsers maintain it.

So for me it looks like all failures are real.

codecov-io commented 6 years ago

Codecov Report

Merging #9 into master will decrease coverage by 1.94%. The diff coverage is 98.73%.

@@            Coverage Diff             @@
##           master       #9      +/-   ##
==========================================
- Coverage     100%   98.05%   -1.95%     
==========================================
  Files           2        2              
  Lines          42      103      +61     
  Branches        6       28      +22     
==========================================
+ Hits           42      101      +59     
- Misses          0        2       +2

Impacted Files	Coverage Δ
html_text/__init__.py	`100% <100%> (ø)`	:arrow_up:
html_text/html_text.py	`98.03% <98.71%> (-1.97%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4836865...05b979a. Read the comment docs.

lopuhin commented 6 years ago

This might be unrelated, but I wonder if we add a space at place of <wbr> tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr - ideally we shouldn't make it (currently html-text does add a space).

kmike commented 6 years ago

I haven't checked everything, but a question about benchmarks: why is selector_to_text slower than extract_text? I expected it to be the other way around, as selector_to_text shouldn't parse html.

Kebniss commented 6 years ago

In the benchmark for selector I included the time to create the selector. Without it the average execution time for the new implementation is avg: 0.00372 while for the old implementation it takes avg: 0.00821. In this case this implementation is ~ 55% faster than master

Kebniss commented 6 years ago

This might be unrelated, but I wonder if we add a space at place of <wbr> tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr - ideally we shouldn't make it (currently html-text does add a space).

I can add this in the following pr

lopuhin commented 6 years ago

I can add this in the following pr

Thanks, that would be great if this does not blow up the scope :)

TeamHG-Memex / html-text

Add guess page layout #9

Codecov Report