geneontology / amigo

AmiGO is the public interface for the Gene Ontology.
http://amigo.geneontology.org
BSD 3-Clause "New" or "Revised" License
29 stars 17 forks source link

General search results rankings (p53 as example) #102

Open cmungall opened 10 years ago

cmungall commented 10 years ago

Typing "p53" (no space) in the search box has this gene showing up first:

http://amigo2.berkeleybop.org/cgi-bin/amigo2/amigo/gene_product/MGI:MGI:2146005

Due to the fact the full name reflects the function of p53 binding.

Adding the space gives better results (we have to fix this I'm afraid).. but the mouse p53 is nowhere to be found.

Even with the MGI filter turned on with this search http://amigo2.berkeleybop.org/cgi-bin/amigo2/amigo/search/bioentity?q=p53

It's hard to find - I had to go via the panther family, eventually I got it: http://amigo2.berkeleybop.org/cgi-bin/amigo2/amigo/gene_product/MGI:MGI:98834

It does have p53 as the synonym...

Let's work with others on fixing this.

kltm commented 10 years ago

For the middle (bioentity live search) case, considering that synonyms are apparently not in there at all, it's doing a pretty good job. I'll add synonyms to the boost config at 1.0 for starters.

Current bio-config.yaml: boost_weights: bioentity^2.0 bioentity_label^2.0 bioentity_name^1.0 bioentity_internal_id^1.0 isa_partof_closure_label^1.0 regulates_closure^1.0 regulates_closure_label^1.0 panther_family^1.0 panther_family_label^1.0 taxon_closure_label^1.0

For the other two, it's a matter of structuring the general search better. Currently, there are three categories: entity (id), entity_label, and a big ball of "stuff"--synonyms are in "stuff", along with everything else. This is done to allow a unioned search across all of the various doc types. We can boost the synonym results either by making another top-level like "important_stuff" that gets weighed higher, or by making synonyms more prevalent in the stuff, by repeating them or something.

kltm commented 10 years ago

Either way, will need to actually change fields or field types in the loader, so we can explore it when we climb into the Java again for 2.2.

monicacecilia commented 10 years ago

Oh wow, yes that looks like a mess of search results. -- There is also similarly "funny" behavior when annotators are using the "GO ID" tool on the "Information Editor" in Web Apollo. As I understand, that tool connects to AmiGO -- but I don't know the details of that connection. "This is another story and shall be told another time" (M. Ende). What is relevant to say is that fixing this search will also improve experiences outside AmiGO.

From Seth: "We can boost the synonym results either by making another top-level like "important_stuff" that gets weighed higher, or by making synonyms more prevalent in the stuff, by repeating them or something."

Likely better to create the bag of "important stuff", than repeating the synonyms.

kltm commented 10 years ago

Hm. A new "important stuff" bag gets one to consider how important the stuff is; maybe we need and "importanter stuff" bag too? That could get very silly pretty fast. OTOH, gaming the schema at too low a level can get fiddly.

Also to consider are issues like #24 and how they would relate to a general schema. We're going to need to extend it a little no matter what it seems. Perhaps I'll change the item to something like: re-engineer the general schema, with a list of things we want out of it.

kltm commented 9 years ago

We've had a similar discussion with @rbalakri about the results with "proximal" and the GO--currently when searching for "proximal", many non-GO terms take priority, which may confuse some users expectations. (E.g. "proximal rib", etc.)

After a little discussion with @cmungall about things that might be done to improve that, one possibility that we might look at is adding a field to the general search schema (maybe document_relevance_category) that would be strings like "core ontology", "peripheral bioentity"; we could then tweak the search to give greater preference to "core" entities or add a collapsible radio button set under the box that allowed you to goose the search for ontology terms, etc.

Essentially searching and giving preference to relevance tagging done during the load stage. While this would require some playing with the loader, I feel that this happens in enough of a transparent way that it might be the way forward.

rbalakri commented 9 years ago

I like this idea. Can we talk about this at Barcelona?

Rama

kltm commented 9 years ago

We can, but this is already scheduled for 2.3, so we'll likely be getting to it post-meeting at some point anyways.

kltm commented 9 years ago

This is related to kltm/bbop-js#16.

kltm commented 9 years ago

Answering @cmungall on #239.

Ideally human would be first followed by MODs. This could be a configuration, or alternatively scoring each gene by number of experimental annotations would be a nice generic way to do it. This would be an easy field for @hdietze to add when loading. 

What this would boil down to would be two new fields, say: search_bin_priority_one and search_bin_priority_two. Human genes would populate the first one, MODs get the second, everybody else gets none.

The search would then be boosted on those two fields, say: search_bin_priority_one^4.0 search_bin_priority_two^2.0.

kltm commented 8 years ago

As another case, from http://jira.geneontology.org/browse/GO-1007, it would be nice to have tokenizing more sensitive to common use cases like let-23, where a user might be surprised by the fact that the tokenizer defaults to breaking on the hyphen.

doughowe commented 8 years ago

From the Noctua session at the Geneva GO meeting..Seth suggested I post this here:

At ZFIN, for autocompletion in term entry boxes, we use a model that allows "starts with" searching for multiple words. This saves many key strokes. Example: Entering "trans fac pol" would find all the terms with the terms including words that match all three: "trans" "fac" "pol*"

like "transcription bla bla factor bla bla bla polymerase bla bla bla"

We really like that mechanism for term searching in ZFIN...food for thought.

cmungall commented 8 years ago

ooh, I like this. @doughow Is this on user-facing autocompletes as well as curation? I can see this as being massively useful for biocurators (although with lego you tend to go for the subset of classes with fewer words, but not always). I don't have a strong sense of whether the average non-power user would do this much

kltm commented 6 years ago

See @ValWood transport example on https://github.com/geneontology/amigo/issues/447

ValWood commented 6 years ago

If you are using lucene we have fine tuned our search over many iterations. We always find what we type, pretty much. @kimrutherford can point you to our weighting.

It might do what Doug describes above too. I'm not sure but it seems to work well for us. I think it even handles typos....

kltm commented 6 years ago

Thank you--more input is always appreciated. That said, we already understand why we have this problem and have implemented an experimental tokenizing/parser fix that solves it (https://github.com/berkeleybop/bbop-manager-golr/issues/4). The issue that we currently have is to rollout the solution and update the software to make use of it.

kimrutherford commented 6 years ago

have implemented an experimental tokenizing/parser fix that solves it (berkeleybop/bbop-manager-golr#4).

That issue mentions EdgeNGramTokenizer, which is what we're using at PomBase.

It might do what Doug describes above too.

We're doing more or less as Doug describes as well as allowing minor typos. We currently index only the names and synonyms. The synonyms get a lower weighting when we query.

doughowe commented 6 years ago

Loooooooong ago @cmungall asked if we use our "multi-word begins with" search mechanism for curators only, or if it is also public facing. I believe it is only for curators. I'm not sure how intuitive or natural it would be for general database users. If you know about it, it whittles down long autosuggest lists quickly, particularly for those pesky long terms you know the name of...sort of.

Actually..I just tried it in our single box search at ZFIN.org and it seems to work there, so that is public facing. Its not hurting anything, and is helpful if you know about it.