TeamHG-Memex / html-text

Extract text from HTML
MIT License
130 stars 24 forks source link

Add guess page layout #9

Closed Kebniss closed 6 years ago

Kebniss commented 6 years ago

I tested the new extraction algorithm on 1200 htmls and timed extraction performance:

Did the same on the old extraction algorithm:

html extraction speed has improved of ~20% while selector extraction improved of ~6%

TODO:

Kebniss commented 6 years ago

The current implementation is ~20% faster than the previous one. I tested extraction on 5 different webpages wit a lot of texts (articles or forums) and the average time for the old approach using selectors is 0.012s while for the new one is 0.010

kmike commented 6 years ago

Hey @Kebniss! Discussion about punctuation handling is here: https://github.com/TeamHG-Memex/html-text/pull/2.

1) It shouldn't depend on guess_punct_space = False option, because this is how HTML works: if there are several whitespaces or newlines inside an element, they're collapsed to one. There are some exceptions, (<pre> elements, or overridden CSS styles), but general rule is to collapse. See https://developer.mozilla.org/en-US/docs/Web/CSS/white-space.

2) Check https://github.com/TeamHG-Memex/html-text/blob/483686543aaf7d13fb6ce69100d7d6e9b4ba226d/html_text/html_text.py#L88 for the motivation. This heuristic produces better text in practice. Special handling of punctuation is added to make it work more reliably.

3) We're maintaining a whitespace because browsers maintain it.

So for me it looks like all failures are real.

codecov-io commented 6 years ago

Codecov Report

Merging #9 into master will decrease coverage by 1.94%. The diff coverage is 98.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master       #9      +/-   ##
==========================================
- Coverage     100%   98.05%   -1.95%     
==========================================
  Files           2        2              
  Lines          42      103      +61     
  Branches        6       28      +22     
==========================================
+ Hits           42      101      +59     
- Misses          0        2       +2
Impacted Files Coverage Δ
html_text/__init__.py 100% <100%> (ø) :arrow_up:
html_text/html_text.py 98.03% <98.71%> (-1.97%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4836865...05b979a. Read the comment docs.

lopuhin commented 6 years ago

This might be unrelated, but I wonder if we add a space at place of <wbr> tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr - ideally we shouldn't make it (currently html-text does add a space).

kmike commented 6 years ago

I haven't checked everything, but a question about benchmarks: why is selector_to_text slower than extract_text? I expected it to be the other way around, as selector_to_text shouldn't parse html.

Kebniss commented 6 years ago

In the benchmark for selector I included the time to create the selector. Without it the average execution time for the new implementation is avg: 0.00372 while for the old implementation it takes avg: 0.00821. In this case this implementation is ~ 55% faster than master

Kebniss commented 6 years ago

This might be unrelated, but I wonder if we add a space at place of <wbr> tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr - ideally we shouldn't make it (currently html-text does add a space).

I can add this in the following pr

lopuhin commented 6 years ago

I can add this in the following pr

Thanks, that would be great if this does not blow up the scope :)