TranslatorSRI / NameResolution

A service for finding CURIEs from lexical strings.
3 stars 2 forks source link

New synonym format leads to much worse querying #50

Closed gaurav closed 9 months ago

gaurav commented 1 year ago

I've set up a NameRes instance on Sterling (accessible in the RENCI VPN only) at http://name-resolution-sri-dev.apps.renci.org/docs using the new synonym format we've built for NameRes (https://github.com/TranslatorSRI/NameResolution/pull/46, https://github.com/helxplatform/translator-devops/pull/634, https://github.com/TranslatorSRI/Babel/pull/113).

You can also directly access the underlying Solr database by running:

$ kubectl port-forward -n translator-exp name-lookup-solr-dep-0 8983:8983

and then accessing http://localhost:8983/ on your computer.

The bad news is that both directly querying Solr and querying it through the NameRes frontend results in significantly worse results than we get with the old system. For example, querying https://name-resolution-sri.renci.org/docs for blood gives us UBERON:0000178, NCIT:C12434 and UMLS:C0851353 (all meaning "blood") followed by UMLS:C0851353 ("bloody"). But running the same query on http://name-resolution-sri-dev.apps.renci.org/docs gives us UMLS:C5169928 ("JWH-073 3-hydroxybutyl (synthetic cannabinoid metabolite) | Blood | Drug toxicology"), UMLS:C5171063 ("Lindane | Blood | Drug toxicology"), UMLS:C0312901 ("Blood group antigen IBH") and a bunch of others.

Searching with Solr gives slightly more relevant results, but not the really good results that https://name-resolution-sri.renci.org/docs gives.

One possible reason for this is that I've indexed the names field as a multiValued field (since it contains multiple values). Changing it to a non-multiValued field definitely helps with the results in Solr, but it causes NameRes to no longer work. I'll try fixing that and see if that solves this bug. If not, I'll probably need some help with the Solr querying and indexing aspect of all this.

gaurav commented 1 year ago

This seems to be caused by the query being names:{fragment}*. Removing the asterisk fixing this problem, and the query (preferred_name:{fragment}^10 OR names:{fragment} OR names:{fragment}*) works pretty well:

https://github.com/TranslatorSRI/NameResolution/blob/61fb6d2c5601563981e05da3fcfc2bebb7723e9f/api/server.py#L104-L111

((preferred_name:{fragment}^10 OR names:{fragment}* still prioritizes odd results over anything that isn't a preferred-name match, and (preferred_name:{fragment}^10 OR names:{fragment} fails to match when the fragment is incomplete, i.e. Alzheimer disease matches but Alzheimer's disease fails.)

gaurav commented 9 months ago

This has now been significantly improved, and it's working well enough that is what is being used by Translator UI. Closing.