Egork/scholar 32314 citation matching

egork520 commented 2 years ago

Modifying similarity function for the citation matching

egork520 commented 2 years ago

I think you said an updated threshold value will be coming from evaluations. This seems important as without it, if I'm reading stuff right, it seems like having just one token in common would be sufficient to make a match. Yes that is correct.

My understanding is that once we have the threshold, we will be ready to put this into production. Is that correct? Yes that is right.

I have one more improvement which is described in the doc. I am using the full bib entry to measure similarity without post processing of the bib entry. On the annotation dataset it reached agreement on most of the items 155. 9 items have disagreed. Out of the 9 items 4 have been matched correctly vs 1 correctly. So we can look at the performances of both methods, currently proposed and currently proposed + enhancement and pick one which has better performance. I believe it is going to be the second one.

egork520 commented 2 years ago

To close the loop on the threshold I have a new section in the doc we decided to use score_threshold = 0.2 Here is the plot of the score vs match rate and score vs coverage of bib entries. Full lines correspond to match rate vs score, dashed lines correspond to coverage vs score. Three set of results are presented:

3 gram approach (current approach used in scholarPhi), score threshold value = 0.5 which corresponds to ~88% of match rate and 7% of items not assigned to s2_id. Results are plotted in magenta color
countVectroizer approach which uses extracted text from the bib entry plotted in green at score = .2 with match rate 98% and 25% of items not getting s2 id assigned
countVectroizer approach which uses full bib entry plotted in yellow at score = 0.2 with match rate of 98.4% and 19% of items not getting s2 id assigned

kelseym-ai2 commented 2 years ago

Much improved AUC! Given that we have a mechanism to fallback to unmatched reference strings in raw form, I think that this very high precision and lower recall is a better user experience. We're only showing users the data that we're confident in, instead of having 12% erroneous cards.

ca16 commented 2 years ago

Thanks for the changes!

allenai / scholarphi

Egork/scholar 32314 citation matching #356