clamsproject / aapb-evaluations

Collection of evaluation codebases
Apache License 2.0
0 stars 1 forks source link

Nel Report Precision, Recall, and F1 scores are unreplicable #68

Open BenLambright opened 3 months ago

BenLambright commented 3 months ago

Bug Description

When I run python evaluate.py preds@dbpedia-spotlight-wrapper@aapb-collaboration-21 golds, I am able to return the same counts for gold and system entities as the report, but not the same precision, accuracy, and recall. The scores for these are either 0 or near-zero numbers.

Reproduction steps

  1. cd to nel_eval
  2. remove guid cpb-aacip-507-nk3610wp6s from both the preds and golds because of its defunct gold data. An error will occur otherwise.
  3. run python evaluate.py preds@dbpedia-spotlight-wrapper@aapb-collaboration-21 golds
  4. view the results

Expected behavior

See the report for the expected behavior.

Log output

No response

Screenshots

No response

Additional context

I have tried different methods of comparing the gold and preds (hashing, strings, manually checking), and at least to me it appears that the criteria for calling the preds and golds NamedEntityLink classes must have changed in the current iteration of evaluate.py from when this report was written.