TranslatorSRI / NameResolution

A service for finding CURIEs from lexical strings.
3 stars 2 forks source link

"Bone" works well when autocomplete=true but breaks when autocomplete=false #142

Open gaurav opened 5 months ago

gaurav commented 5 months ago

When autocomplete is set to true, we set the query parameter query to e.g. (water) OR (water*) so that we can include cases where the search query is incomplete (e.g. (bloo) OR (bloo*) lets us find blood).

When autocomplete is false, we set query to just (water). In most cases this works fine, but sometimes this presents very different results. Compare:

We can restore the previous results by repeating the search query twice as before, i.e. (bone) OR (bone). I tried that out in a branch and confirmed that it does work:

https://github.com/TranslatorSRI/NameResolution/blob/8006348313b9eb619673c6c6362dec19829f342c/api/server.py#L301

Looking at the explain output, it looks like both queries set off a search for "bone bone", which might be pulling the correct UBERON term higher up:

    "rawquerystring":"(bone) OR (bone*)",
    "querystring":"(bone) OR (bone*)",
    "parsedquery":"+(DisjunctionMaxQuery((names:bone | (preferred_name_exactish:bone)^10.0 | preferred_name:bone)) DisjunctionMaxQuery((names:bone* | preferred_name:bone* | (preferred_name_exactish:bone*)^10.0))) DisjunctionMaxQuery(((names:\"bone bone\")^2.0 | (preferred_name:\"bone bone\")^3.0))",
    "parsedquery_toString":"+((names:bone | (preferred_name_exactish:bone)^10.0 | preferred_name:bone) (names:bone* | preferred_name:bone* | (preferred_name_exactish:bone*)^10.0)) ((names:\"bone bone\")^2.0 | (preferred_name:\"bone bone\")^3.0)",
    "rawquerystring":"(bone) OR (bone)",
    "querystring":"(bone) OR (bone)",
    "parsedquery":"+(DisjunctionMaxQuery((names:bone | (preferred_name_exactish:bone)^10.0 | preferred_name:bone)) DisjunctionMaxQuery((names:bone | (preferred_name_exactish:bone)^10.0 | preferred_name:bone))) DisjunctionMaxQuery(((names:\"bone bone\")^2.0 | (preferred_name:\"bone bone\")^3.0))",
    "parsedquery_toString":"+((names:bone | (preferred_name_exactish:bone)^10.0 | preferred_name:bone) (names:bone | (preferred_name_exactish:bone)^10.0 | preferred_name:bone)) ((names:\"bone bone\")^2.0 | (preferred_name:\"bone bone\")^3.0)",

So the real solution here would be to improve the search so that we don't need to duplicate terms. The clique count I'm currently testing might help with that, but the scores might also be different enough that that doesn't make a difference. If that's the case, we'll need to be smarter about this.