Closed rahulbot closed 3 years ago
Merging #29 (d85f893) into master (d6d34d3) will not change coverage. The diff coverage is
n/a
.:exclamation: Current head d85f893 differs from pull request most recent head 628d623. Consider uploading reports for the commit 628d623 to get more accurate results
@@ Coverage Diff @@
## master #29 +/- ##
=======================================
Coverage 92.36% 92.36%
=======================================
Files 7 7
Lines 943 943
=======================================
Hits 871 871
Misses 72 72
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update d6d34d3...628d623. Read the comment docs.
Hi @rahulbot, thank you for your work! I'll go through the code first to see if the pull request can be accepted as is.
I'm glad you tried date extraction on a more diverse dataset and still consider switching to htmldate! Is there a reason why you didn't measure execution speed?
Hi @rahulbot @coreydockser,
Just to be clear: I'd like to accept your pull request but I'd be better to keep a rough comparison of speed execution. You deleted it in the new version of comparison.py
.
Could you please time the execution and add a ratio as originally implemented, or could you keep the old comparison as a legacy file?
Yes - @coreydockser is working on adding the execution speed measurement back in. We should be able to update this PR once he has fixed that.
New results, with timing against our global story test set:
Name Precision Recall Accuracy F-score Time (Relative to htmldate extensive) Time (Relative to htmldate fast)
-------------------- ----------- -------- ---------- --------- --------------------------------------- ----------------------------------
htmldate extensive 0.753 0.935 0.715 0.833819 1.00x 2.02x
htmldate fast 0.763 0.830 0.660 0.795181 0.50x 1.00x
newspaper 0.671 0.662 0.500 0.666667 4.61x 9.30x
newsplease 0.737 0.788 0.615 0.76161 8.40x 16.93x
articledateextractor 0.730 0.675 0.540 0.701299 1.29x 2.61x
date_guesser 0.687 0.582 0.460 0.630137 5.29x 10.67x
goose 0.746 0.497 0.425 0.596491 3.20x 6.45x
Hi @rahulbot @coreydockser,
Thanks for the changes! There is only one question left: should the two datasets be merged into one?
I believe so, what do you think? We could do the merge in this PR as well, feel free to implement it :)
Sure - we wanted to get your opinion on that so we left it open as a possible path forward. @coreydockser perhaps you could have an array of files, then load and merge all of them before running tests? That would be a solution that easily allows for multiple test sets but still acknowledges that they came from different sources. Ie. something like this pseudo-code:
eval_files = [
"eval_mediacloud_2020.json", # 200 random stories from Media Cloud 2020
"eval_default" # original mostly German test set
]
EVAL_PAGES = []
for f in eval_files:
# load the file's stories
# merge with `EVAL_PAGES`
# now EVAL_PAGES is a big list of all the test story data from the list of `eval_files` combined
@coreydockser @rahulbot Alright, thanks!
This adds a new set of test cases based on a global random sample of 200 articles from the Media Cloud dataset (related to #8). We currently use our own
date_guesser
library and are evaluating switching thehtmldate
.This new corpus includes 200 articles discovered via a search of stories from 2020 in the Media Cloud corpus. The set of countries of origin, and languages, is representative of the ~60k sources we ingest from every day.
The
htmldate
code still performs well against this new test corpus:A few notes:
comparison.py
to load test data from.json
files so the test data is isolated from the code itself.test/eval_mediacloud_2020.json
, with HTML cached intests/eval
.tabulate
module, and saved to the file system.We hope this contribution helps document the performance of the various libraries against a more global dataset.