allenai / scholarphi

An interactive PDF reader.
Apache License 2.0
416 stars 52 forks source link

Egork/scholar 32314 citation matching #356

Closed egork520 closed 2 years ago

egork520 commented 2 years ago

Modifying similarity function for the citation matching

egork520 commented 2 years ago

I think you said an updated threshold value will be coming from evaluations. This seems important as without it, if I'm reading stuff right, it seems like having just one token in common would be sufficient to make a match. Yes that is correct.

My understanding is that once we have the threshold, we will be ready to put this into production. Is that correct? Yes that is right.

I have one more improvement which is described in the doc. I am using the full bib entry to measure similarity without post processing of the bib entry. On the annotation dataset it reached agreement on most of the items 155. 9 items have disagreed. Out of the 9 items 4 have been matched correctly vs 1 correctly. So we can look at the performances of both methods, currently proposed and currently proposed + enhancement and pick one which has better performance. I believe it is going to be the second one.

egork520 commented 2 years ago

To close the loop on the threshold I have a new section in the doc we decided to use score_threshold = 0.2 Here is the plot of the score vs match rate and score vs coverage of bib entries. Full lines correspond to match rate vs score, dashed lines correspond to coverage vs score. Three set of results are presented:

image

kelseym-ai2 commented 2 years ago

Much improved AUC! Given that we have a mechanism to fallback to unmatched reference strings in raw form, I think that this very high precision and lower recall is a better user experience. We're only showing users the data that we're confident in, instead of having 12% erroneous cards.

ca16 commented 2 years ago

Thanks for the changes!