PathwayCommons / grounding-search

A biological entity grounding search service
MIT License
8 stars 5 forks source link

Provide multiple coreferent mentions to improve grounding #115

Open JohnGiorgi opened 3 years ago

JohnGiorgi commented 3 years ago

There are some situations where entity mentions have different surface forms, but should ultimately be grounded to the same ID. This includes acronyms as well as coreferent mentions, e.g.:

CCNU (lomustine) toxicity in dogs. To describe the incidence of hematological, renal, hepatic and gastrointestinal toxicities in tumour-bearing dogs receiving 1-(2-chloroethyl)-3-cyclohexyl-1-nitrosourea (CCNU).

I believe it is standard practice to try to ground each mention independently (call it mention-level grounding for lack of a better term). But I can imagine that trying to ground all three mentions (call it entity-level grounding) should reduce ambiguity and therefore increase grounding accuracy.

I am not sure exactly how it would work in grounding-search. I can continue to think about it a little but I would be interested to hear if you guys think this is feasible. I already have a machine learning-based method that identifies mentions and then groups them into coreferent clusters, so it would be able to take advantage of something like this.

maxkfranz commented 3 years ago

It should be possible in one query if you can generate a single main mention (the "best" one or an aggregate) from the cluster. I wonder how well the naive approach of selecting the first mention would work.

An alternative would be to allow you to send a cluster (array) of mentions as input to the service. Even the naive approach for that would at least save network overhead (i.e. do separate queries internally).

JohnGiorgi commented 3 years ago

It should be possible in one query if you can generate a single main mention (the "best" one or an aggregate) from the cluster. I wonder how well the naive approach of selecting the first mention would work.

Yes I should have clarified that I can make it work just by choosing one of the mentions. For now, I choose the longest, with the intuition being that its likely to be the least ambiguous. This seems to work pretty well.

An alternative would be to allow you to send a cluster (array) of mentions as input to the service. Even the naive approach for that would at least save network overhead (i.e. do separate queries internally).

Yeah exactly what I was thinking!

Using the above example you could imagine a situation where "CCNU" is ambiguous and brings up multiple hits, but querying with ["CCNU", "lomustine"] or even ["CCNU", "lomustine", "1-(2-chloroethyl)-3-cyclohexyl-1-nitrosourea (CCNU)"] reduces the ambiguity and leads to one clear hit. As you said, the naive approach might be enough (separate queries internally).