adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.57k stars 256 forks source link

In parallel trafilatura is marginally slower than goose #262

Closed getorca closed 1 year ago

getorca commented 2 years ago

I'm not quite sure where to begin with this, it's a strange one. In a real world scenario I tried switching from Goose3 to Trafilatura. I'm processing html extractions in parallel with dask. After switching to trafilatura, I noticed a 30% slowdown. I ended up writing my own evaluation library to verify the results.

Results from running in parallel: ┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ goose3 │ 0.9678 │ 0.8561 │ 0.9547 │ 0.9027 │ 0.8343 │ 383.4737 │ │ trafilatura │ 0.9124 │ 0.9485 │ 0.908 │ 0.9278 │ 0.8567 │ 361.3232 │ └─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

Results from running sequentially: ┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ goose3 │ 0.9678 │ 0.8561 │ 0.9547 │ 0.9027 │ 0.8343 │ 9.7953 │ │ trafilatura │ 0.9124 │ 0.9485 │ 0.908 │ 0.9278 │ 0.8567 │ 23.0045 │ └─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

Note: the dataset evaluated in from scrapinghub/article-extraction-benchmark tool. The only portion of the code that runs in parallel for the bench marks is the extraction. Only the extraction is timed for calculating items/sec.

In summary: trafilatura is marginally slower than Goose3 in parallel. However sequentially it is twice as fast as Goose3.

I'm not sure where to begin with this. It can be difficult to profile parallel processing. It may be related to some of the memory leak issues reported with trafilutura, although it appears those have been resolved. Or the caching, I haven't looked into how that functions.

I will work on publishing my benchmarking tool this afternoon.

adbar commented 1 year ago

Hi @getorca, thanks for the evaluation!

First, you get different results than the Scrapinhub team, I assume it's because of more recent package versions? The Trafilatura results appear to be slightly degraded, I wonder if it's a regression or just a different experimental setting on your side.

Second, I am not familiar with the way Dask parallelizes tasks. Your results are odd indeed but I cannot explain this difference.

getorca commented 1 year ago

Owner

Hi @adbar There was a mistake in the timings for parallel, but trafilatura doesn't see anywhere near as much of an improvement as goose3, however, resiliparse, is significantly slower. It's very odd, I'm trying to dig into why, it might be related to heavy use of cython/c++, for resiliparse.

I'm using dask bags, which use multiprocessing. It's a high level library on top of pythons multiprocessing/threads(I believe), that also creates an optimised DAG. The multiprocessing scheduler does add about 200us overhead per task. And has some issues with shared memory, which is what made me think of possible memory leaks. As well as it's fastest on pure python objects, I wonder if some of the libraries trafilatura imports are cython/c++.

First, you get different results than the Scrapinhub team, I assume it's because of more recent package versions? The Trafilatura results appear to be slightly degraded, I wonder if it's a regression or just a different experimental setting on your side.

I use a method similar to scrapinghubs "shingles"(n-gams) to compute accuracy, precision and fscore, but with vectors from spacy, closer to how the original moz evaluations worked.

My benchmarking tool is available here, https://github.com/Nootka-io/wee-benchmarking-tool. I will try to work on some minimal samples to try and sort out the parallel "oddities".

adbar commented 1 year ago

@getorca Thanks for sharing!

I don't understand why newspaper3k is performing that well, it's not the case in the Scrapinghub benchmark, nor it is the case in any multilingual benchmark I've seen. My experience is that it is good for English but less so in other settings.

You may also want to use another Trafilatura function, I wrote a PR (https://github.com/Nootka-io/wee-benchmarking-tool/pull/1).

getorca commented 1 year ago

Interesting, I haven't looked at too many other benchmarks.

It appears the shingles as defined in scraping hubs library can produce either 1 or 4 false positives for a single incorrect token depending on the location in the body of text, as well as the shingle length being less than the . Take a look at the below minimal example:

from wee_cli.evaluate import do_complex_scoring, scores_from_cm

"""
A demonstration of where the occurrence of an incorrect token when using shingles to calculate Precision and Recall,
 can lead to different score.
 There are also always 3 less shingles than tokens, as well as up 4x more false positives and 4x more false negatives.
"""

def get_shingles(tokens):
    return [tuple(tokens[i:i+4]) for i in range(0, max(1, len(tokens) - 4 + 1))]

gt_tokens = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
pred_tokens_a = ['x', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
pred_tokens_b = ['a', 'b', 'c', 'x', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']

gt_shingles = get_shingles(gt_tokens)
pred_shingles_a = get_shingles(pred_tokens_a)
pred_shingles_b = get_shingles(pred_tokens_b)

# token confusion  matrix A
t_a_cm = []
t_a_cm.append(do_complex_scoring(gt_tokens, pred_tokens_a))
print('** Score built with tokens on A**')
print(scores_from_cm(t_a_cm))

# ** Score built with tokens on A**
# {'accuracy': 1.0, 'precision': 0.9166666666666666, 'recall': 1.0, 'fscore': 0.9565217391304348}

# token confusion  matrix B
t_b_cm = []
t_b_cm.append(do_complex_scoring(gt_tokens, pred_tokens_b))
print('** Score built with tokens on B **')
print(scores_from_cm(t_b_cm))

# ** Score built with tokens on B **
# {'accuracy': 1.0, 'precision': 0.9166666666666666, 'recall': 1.0, 'fscore': 0.9565217391304348}

# shingles confusion matrix A
a_cm = []
a_cm.append(do_complex_scoring(gt_shingles, pred_shingles_a))
print('** Score built with shingles on B **')
print(scores_from_cm(a_cm))

# ** Score built with tokens on B **
# {'accuracy': 1.0, 'precision': 0.8888888888888888, 'recall': 1.0, 'fscore': 0.9411764705882353}

# shingles confusion matrix B
b_cm = []
b_cm.append(do_complex_scoring(gt_shingles, pred_shingles_b))
print('** Score built with shingles on B **')
print(scores_from_cm(b_cm))

# ** Score built with shingles on B **
# {'accuracy': 0.625, 'precision': 0.5555555555555556, 'recall': 0.625, 'fscore': 0.5882352941176471}

For this reason, I'm not convinced shingles are better. The way you do it is interesting, but a bit hard to annotate. I think the best solution might be pulling all the text from the html, and using this to calculate true negatives. I think that should provide more accuracy and better normalization when combined with the tokens.

And thanks, I'll take a look at the pull request shortly.

adbar commented 1 year ago

Thanks again for sharing! Yes, my annotation method needs time and cannot be extrapolated easily, besides there are other ways to evaluate. But as you demonstrate results of the shingles method can vary a lot.

getorca commented 1 year ago

No problem, I switched from a ratio of averages to a average of ratios, which appear to giver better metrics, and should mean outliers have less influences on the metrics. Thanks, for pointing out the issue there.

getorca commented 1 year ago

@adbar Good news, it seems like this was related to the overhead from the way dask serialises the python objects. https://github.com/chatnoir-eu/chatnoir-resiliparse/issues/23

I've switched to python multiprocessing pool, and trafilutura is a little over 2x as fast in parallel compared to in sequence. And stays massively faster than goose3.

I'll push an update to my benchmark tool this aft.

adbar commented 1 year ago

@getorca Very nice, does that mean we can close this issue now?

getorca commented 1 year ago

Yes, absolutely, go ahead