rahulbot commented 3 years ago

This adds a new set of test cases based on a global random sample of 200 articles from the Media Cloud dataset (related to #8). We currently use our own date_guesser library and are evaluating switching the htmldate.

This new corpus includes 200 articles discovered via a search of stories from 2020 in the Media Cloud corpus. The set of countries of origin, and languages, is representative of the ~60k sources we ingest from every day.

The htmldate code still performs well against this new test corpus:

Name                    Precision    Recall    Accuracy    F-score
--------------------  -----------  --------  ----------  ---------
htmldate extensive       0.755102  0.973684       0.74    0.850575
htmldate fast            0.769663  0.861635       0.685   0.813056
newspaper                0.671141  0.662252       0.5     0.666667
newsplease               0.736527  0.788462       0.615   0.76161
articledateextractor     0.72973   0.675          0.54    0.701299
date_guesser             0.686567  0.582278       0.46    0.630137
goose                    0.75      0.508772       0.435   0.606272

A few notes:

We changed comparison.py to load test data from .json files so the test data is isolated from the code itself.
The new set of stories and dates are in test/eval_mediacloud_2020.json, with HTML cached in tests/eval.
Then evaluation results are now printed out via the tabulate module, and saved to the file system.
Perhaps the two evaluations sets should be merged into one larger one? Or the scores combined between them? We weren't sure how to approach this.
Interesting to note that overall all the precision scores were lower against this corpus - more false positives. Recall actually slightly better against this set - fewer false negatives.

We hope this contribution helps document the performance of the various libraries against a more global dataset.

codecov-commenter commented 3 years ago

Codecov Report

Merging #29 (d85f893) into master (d6d34d3) will not change coverage. The diff coverage is n/a.

:exclamation: Current head d85f893 differs from pull request most recent head 628d623. Consider uploading reports for the commit 628d623 to get more accurate results

@@           Coverage Diff           @@
##           master      #29   +/-   ##
=======================================
  Coverage   92.36%   92.36%           
=======================================
  Files           7        7           
  Lines         943      943           
=======================================
  Hits          871      871           
  Misses         72       72

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d6d34d3...628d623. Read the comment docs.

adbar commented 3 years ago

Hi @rahulbot, thank you for your work! I'll go through the code first to see if the pull request can be accepted as is.

I'm glad you tried date extraction on a more diverse dataset and still consider switching to htmldate! Is there a reason why you didn't measure execution speed?

adbar commented 3 years ago

Hi @rahulbot @coreydockser,

Just to be clear: I'd like to accept your pull request but I'd be better to keep a rough comparison of speed execution. You deleted it in the new version of comparison.py.

Could you please time the execution and add a ratio as originally implemented, or could you keep the old comparison as a legacy file?

rahulbot commented 3 years ago

Yes - @coreydockser is working on adding the execution speed measurement back in. We should be able to update this PR once he has fixed that.

rahulbot commented 3 years ago

New results, with timing against our global story test set:

Name                    Precision    Recall    Accuracy    F-score  Time (Relative to htmldate extensive)    Time (Relative to htmldate fast)
--------------------  -----------  --------  ----------  ---------  ---------------------------------------  ----------------------------------
htmldate extensive          0.753     0.935       0.715   0.833819  1.00x                                    2.02x
htmldate fast               0.763     0.830       0.660   0.795181  0.50x                                    1.00x
newspaper                   0.671     0.662       0.500   0.666667  4.61x                                    9.30x
newsplease                  0.737     0.788       0.615   0.76161   8.40x                                    16.93x
articledateextractor        0.730     0.675       0.540   0.701299  1.29x                                    2.61x
date_guesser                0.687     0.582       0.460   0.630137  5.29x                                    10.67x
goose                       0.746     0.497       0.425   0.596491  3.20x                                    6.45x

adbar commented 3 years ago

Hi @rahulbot @coreydockser,

Thanks for the changes! There is only one question left: should the two datasets be merged into one?

I believe so, what do you think? We could do the merge in this PR as well, feel free to implement it :)

rahulbot commented 3 years ago

Sure - we wanted to get your opinion on that so we left it open as a possible path forward. @coreydockser perhaps you could have an array of files, then load and merge all of them before running tests? That would be a solution that easily allows for multiple test sets but still acknowledges that they came from different sources. Ie. something like this pseudo-code:

eval_files = [
  "eval_mediacloud_2020.json",  # 200 random stories from Media Cloud 2020
  "eval_default" # original mostly German test set
]
EVAL_PAGES = []
for f in eval_files:
  # load the file's stories
  # merge with `EVAL_PAGES`
# now EVAL_PAGES is a big list of all the test story data from the list of `eval_files` combined

adbar commented 3 years ago

@coreydockser @rahulbot Alright, thanks!

adbar / htmldate

Add new test cases including more global stories #29

Codecov Report