geneontology / amigo

AmiGO is the public interface for the Gene Ontology.
http://amigo.geneontology.org
BSD 3-Clause "New" or "Revised" License
29 stars 17 forks source link

IDs are tokenized during search gives strange results (bad tokenizing on colons) #157

Open kltm opened 10 years ago

kltm commented 10 years ago

This is the trackable issue for:

http://jira.geneontology.org/browse/GO-624

The current statement of the issue is:

This is considered an issue because it seems unlikely this gene would have the seen associations with GO:0007072.

Possibilities to consider are something wrong with the search such the the "go" bits in the synonyms are matching (although why so few then) or there is a hiccup in the ontology and these really are in the closure.

kltm commented 10 years ago

Ugh. It is looking like the "go" bits now. For example, take:

http://amigo2.berkeleybop.org/amigo/gene_product/RGD:1308769

any search starting with "go:" removes all annotations. Looking at:

http://amigo.geneontology.org/amigo/gene_product/FB:FBgn0000535

all attempts at filtering with "go:" string fail--nothing is filtered.

Direct response with debug at:

http://golr.berkeleybop.org/select?defType=edismax&qt=standard&indent=on&wt=json&rows=10&start=0&fl=*,score&facet=true&facet.mincount=1&facet.sort=count&json.nl=arrarr&facet.limit=25&hl=true&hl.simple.pre=%3Cem%20class=%22hilite%22%3E&fq=document_category:%22annotation%22&fq=bioentity:%22FB:FBgn0000535%22&facet.field=source&facet.field=assigned_by&facet.field=aspect&facet.field=evidence_type_closure&facet.field=panther_family_label&facet.field=qualifier&facet.field=taxon_closure_label&facet.field=annotation_class_label&facet.field=regulates_closure_label&facet.field=annotation_extension_class_closure_label&q=go:0022008&qf=annotation_class^2&qf=annotation_class_label_searchable^1&qf=bioentity^2&qf=bioentity_label_searchable^1&qf=bioentity_name_searchable^1&qf=annotation_extension_class^2&qf=annotation_extension_class_label_searchable^1&qf=reference^1&qf=panther_family_searchable^1&qf=panther_family_label_searchable^1&qf=bioentity_isoform^1&qf=regulates_closure^1&qf=regulates_closure_label_searchable^1&debugQuery=true

Quoting the string prevents this. As well, you can see in the results that it is tokenizing on the colon.

kltm commented 10 years ago

This seems to be the same issue as #93. However, since the explanation is cleared here, I'm going to mark the earlier one as a dupe (although is should be read to get more background).

The current takeaway is that this is an issue and that there are a fair number of colon related issues in Solr, and it is probably not worth ripping up the plumbing right before we switch to Solr 4.x (which may have fixed this case or have slightly different issues, see: kltm/bbop-js#16).

The current workaround for this is that in the case of ID search in free text (which was considered a marginal case initially, but not now), one can use quotes to force the correct behaviour.

kltm/bbop-js#16

cmungall commented 10 years ago

Can we not have the API intercept these queries and auto-quote them?

kltm commented 10 years ago

So specifically, to propose a possible fix, you might add something to the consumer search function (https://kltm.github.io/bbop-js/docs/files/golr/manager-js.html#bbop.golr.manager.set_comfy_query): any "token" that had a colon in it would be not further split downstream by being automatically quoted at this stage. I'm not wild about this approach here, mainly because there seem to be 1) actual problems with what our version of Solr is doing with the colons and 2) I believe the tokenizer we're using for searchables is eliminating them anyways. I'm not immediately sure how to work around these except for revisiting from the backend up. For example, take:

http://a2-proxy1.stanford.edu/solr/select?defType=edismax&qt=standard&indent=on&wt=json&rows=10&start=0&fl=annotation_class,description,source,synonym,alternate_id,annotation_class_label,score,id&facet=true&facet.mincount=1&facet.sort=count&json.nl=arrarr&facet.limit=25&hl=true&hl.simple.pre=%3Cem%20class=%22hilite%22%3E&fq=document_category:%22ontology_class%22&facet.field=source&facet.field=subset&facet.field=regulates_closure_label&facet.field=is_obsolete&qf=annotation_class^3&qf=annotation_class_label_searchable^5.5&qf=description_searchable^1&qf=comment_searchable^0.5&qf=synonym_searchable^1&qf=alternate_id^1&qf=regulates_closure^1&qf=regulates_closure_label_searchable^1&debugQuery=true&q=%22GO:0008750%22

you can see that it /mostly/ removed the colon from existence in the parsed query, meaning that there is certainly no match (this would likely be due to the search tokenizer we're using on the Solr end for "_searchable"s). Trying a couple of ways to url encode that ahead of time doesn't help, and gets the parsed query even weirder; moreover, even if you could, I don't believe anything would match anyways.

I think the easiest approach would be to switch to the better fixed 4.6 and take out a lot of these super annoying search issues in the process.

cmungall commented 7 years ago

Also from http://jira.geneontology.org/browse/GO-1428

Seems odd that this search returns no result: http://amigo.geneontology.org/amigo/medial_search?q=S000000031 would have expected it to return this entry: http://amigo.geneontology.org/amigo/gene_product/SGD:S000000031

kltm commented 7 years ago

This is the expected behavior given the tokenizing issue. Now that the work has been done for the new tokenizing with GOlr in the monarch stack, we just need to port it over to AmiGO by updating bbop-manager-golr.

cmungall commented 6 years ago

We're running into this again, see https://github.com/geneontology/helpdesk/issues/99

This is really key, people really expect to be able to search with the non-prefixed part of the ID. Do we still need to change bbop-manager-golr? Isn't this just a matter of adding the unprefixed form as something solr searches on?

Antonialock commented 6 years ago

hi, any news on this? we really would like to do some analysis for a paper that we would like to submit ASAP @ValWood

ValWood commented 6 years ago

@Antonialock This isn't our primary issue. this is a side issue.