adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.39k stars 251 forks source link

Re-running article extraction benchmark #156

Closed vprelovac closed 2 years ago

vprelovac commented 2 years ago

In last benchmark https://github.com/scrapinghub/article-extraction-benchmark#open-source-libraries Trafilatura scored well, but not best. This was with version 0.5.1. I am wondering if you can re-run the benchmark with latest version.

vprelovac commented 2 years ago

Re-run them myself! trafilatura, goose3, dragnet and news-please were updated to latest version.

Looking at F1 score trafilatura scores third overall, but it was the fastest of the 4 I tried. Good job!

AutoExtract precision=0.984 ± 0.003 recall=0.956 ± 0.010 F1=0.970 ± 0.005 accuracy=0.470 ± 0.037 Diffbot precision=0.958 ± 0.009 recall=0.944 ± 0.013 F1=0.951 ± 0.010 accuracy=0.348 ± 0.035 beautifulsoup precision=0.499 ± 0.017 recall=0.994 ± 0.001 F1=0.665 ± 0.015 accuracy=0.000 ± 0.000 boilerpipe precision=0.850 ± 0.016 recall=0.870 ± 0.020 F1=0.860 ± 0.016 accuracy=0.006 ± 0.005 dragnet precision=0.925 ± 0.012 recall=0.889 ± 0.018 F1=0.907 ± 0.013 accuracy=0.221 ± 0.030 go_domdistiller precision=0.901 ± 0.010 recall=0.956 ± 0.010 F1=0.927 ± 0.007 accuracy=0.066 ± 0.018 go_readability precision=0.912 ± 0.009 recall=0.975 ± 0.006 F1=0.943 ± 0.007 accuracy=0.210 ± 0.030 goose3 precision=0.934 ± 0.014 recall=0.847 ± 0.020 F1=0.889 ± 0.016 accuracy=0.227 ± 0.032 html-text precision=0.500 ± 0.018 recall=0.994 ± 0.001 F1=0.665 ± 0.016 accuracy=0.000 ± 0.000 html2text precision=0.499 ± 0.017 recall=0.983 ± 0.002 F1=0.662 ± 0.015 accuracy=0.000 ± 0.000 inscriptis precision=0.517 ± 0.018 recall=0.993 ± 0.001 F1=0.679 ± 0.016 accuracy=0.000 ± 0.000 justext precision=0.858 ± 0.016 recall=0.754 ± 0.028 F1=0.802 ± 0.018 accuracy=0.088 ± 0.021 news_please precision=0.917 ± 0.013 recall=0.906 ± 0.018 F1=0.911 ± 0.014 accuracy=0.249 ± 0.032 newspaper precision=0.917 ± 0.013 recall=0.906 ± 0.017 F1=0.912 ± 0.014 accuracy=0.260 ± 0.032 readability precision=0.913 ± 0.014 recall=0.931 ± 0.015 F1=0.922 ± 0.013 accuracy=0.315 ± 0.035 readability_js precision=0.853 ± 0.013 recall=0.924 ± 0.012 F1=0.887 ± 0.012 accuracy=0.149 ± 0.027 trafilatura precision=0.924 ± 0.012 recall=0.968 ± 0.008 F1=0.946 ± 0.009 accuracy=0.271 ± 0.033 xpath-text precision=0.246 ± 0.015 recall=0.992 ± 0.001 F1=0.394 ± 0.020 accuracy=0.000 ± 0.000

adbar commented 2 years ago

Hi @vprelovac, thanks for your interest. Nice results indeed, especially considering the first two are not freely available IMO.

I notified the makers of the benchmark of the idea to test it with newer versions.