cognoma / frontend

Frontend for Project Cognoma
http://cognoma.org/
Other
4 stars 22 forks source link

mygene.info queries are not returning the expected results #169

Closed ramenhog closed 6 years ago

ramenhog commented 6 years ago

Example 1: https://mygene.info/v3/query?q=type_of_gene:protein-coding%20AND%20BRCA&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias

Expected to return BRCA2

Example 2: https://mygene.info/v3/query?q=type_of_gene:protein-coding%20AND%20TP53&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias

Expected: TP53

Example 3: https://mygene.info/v3/query?q=type_of_gene:protein-coding%20AND%20KRAS&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias

Expected KRAS to have highest score

dhimmel commented 6 years ago

For reference, the BRCA2 query returns:

{
  "max_score": null,
  "took": 3,
  "total": 0,
  "hits": []
}

The TP53 query returns TP53TG5 as the top result.

The KRAS query returns (note KRAS has lowest score):

{
  "max_score": 5.8078055,
  "took": 12,
  "total": 3,
  "hits": [
    {
      "_id": "8082",
      "_score": 5.8078055,
      "entrezgene": 8082,
      "name": "sarcospan",
      "symbol": "SSPN",
      "taxid": 9606
    },
    {
      "_id": "6744",
      "_score": 2.9015386,
      "entrezgene": 6744,
      "name": "sperm specific antigen 2",
      "symbol": "SSFA2",
      "taxid": 9606
    },
    {
      "_id": "3845",
      "_score": 1.6174822,
      "entrezgene": 3845,
      "name": "KRAS proto-oncogene, GTPase",
      "symbol": "KRAS",
      "taxid": 9606
    }
  ]
}

@newgene, do you still work on mygene.info / have any advice? If not, who is the current contact? Thanks.

newgene commented 6 years ago

@dhimmel Yes, we are looking into this issue now.

andrewsu commented 6 years ago

We will definitely look at fine tuning our scoring scheme. In the mean time, a few suggestions.

Most simply, dropping the type_of_gene:protein-coding search parameter results in better tuned sorting:

https://mygene.info/v3/query?q=BRCA2&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias https://mygene.info/v3/query?q=TP53&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias https://mygene.info/v3/query?q=KRAS&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias

(Note 1: you can also append &fields=entrezgene,name,symbol,ensembl.type_of_gene,taxid which will enable you to do easy filtering on the results.) (Note 2: the first query in the issue searches for "BRCA", which correctly returns no results. You could do a wildcard query https://mygene.info/v3/query?q=BRCA*&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias but again we need to tune the scoring.)

If you know you are usually searching for a gene symbol, you can also automatically bump the weight in the scoring scheme:

https://mygene.info/v3/query?q=type_of_gene:protein-coding%20AND%20(BRCA2%20OR%20symbol:BRCA2^2)&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias https://mygene.info/v3/query?q=type_of_gene:protein-coding%20AND%20(TP53%20OR%20symbol:TP53^2)&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias https://mygene.info/v3/query?q=type_of_gene:protein-coding%20AND%20(KRAS%20OR%20symbol:KRAS^2)&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias

But again, we'll be creating a ticket in the morning to improve scoring without jumping through all these hoops.

cgreene commented 6 years ago

@andrewsu : is there a way to get a good results list for something like a search? For instance, if there's an exact match, we'd love that, but if there are no exact matches then it'd be nice to use the wildcard search.

Also, our system can only deal with protein-coding genes, so any results returned by mygene.info that aren't protein coding would result in something that we can add to the users query but that will have no effect. We wanted to avoid that, since it could be confusing to our users.

dhimmel commented 6 years ago

Thanks @andrewsu for the details.

We do want a wildcard search with the following caveat: if there's an exact symbol match it should receive the highest score.

We would be able to filter for type_of_gene:protein-coding client-side, although it would obviously be simpler if we could specify this at the API level.

Here's the current spec we're using:

https://github.com/cognoma/frontend/blob/3dc9cf9bab5be4c197bfe67446e8043e6d61319f/app/js/constants.js#L19-L27

Looks like we're only searching symbol and alias. We probably also want to search name, but at a much lower weight. Given my understanding of genes, I think we want the following prioritization of match scores:

  1. exact symbol math
  2. exact alias match
  3. wildcard symbol match
  4. words-in-name match

Is it possible to specify this scoring scheme with the current API?

andrewsu commented 6 years ago

It's not entirely clear to me why, but if you change your q parameter from type_of_gene:protein-coding AND BRCA to type_of_gene:protein-coding AND (BRCA* OR symbol:BRCA OR symbol:BRCA*), I think you get the desired behavior. Seems to work if you have a full gene symbol specified, eg:

https://mygene.info/v3/query?q=type_of_gene:protein-coding%20AND%20(BRCA2*%20OR%20symbol:BRCA2*%20OR%20symbol:BRCA2)&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias https://mygene.info/v3/query?q=type_of_gene:protein-coding%20AND%20(TP53*%20OR%20symbol:TP53*%20OR%20symbol:TP53)&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias https://mygene.info/v3/query?q=type_of_gene:protein-coding%20AND%20(KRAS*%20OR%20symbol:KRAS*%20OR%20symbol:KRAS)&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias

and give reasonable answers if you have a partial match:

https://mygene.info/v3/query?q=type_of_gene:protein-coding%20AND%20(BRCA*%20OR%20symbol:BRCA*%20OR%20symbol:BRCA)&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias https://mygene.info/v3/query?q=type_of_gene:protein-coding%20AND%20(TP5*%20OR%20symbol:TP5*%20OR%20symbol:TP5)&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias https://mygene.info/v3/query?q=type_of_gene:protein-coding%20AND%20(KRA*%20OR%20symbol:KRA*%20OR%20symbol:KRA)&entrezonly=true&size=100&species=human&suggest_from=symbol%5E2,alias

Hmm, can't seem to figure out how to do the words-in-name match. Will ping the pros to chime in here...

cgreene commented 6 years ago

Interesting! The results in those answers appear to be sufficient enough for our purposes at this time, I think, though words-in-name would be ideal. Can we count on this behavior being stable at this time? May wait to see what the pros say 😁

andrewsu commented 6 years ago

I agree, we need to work on the default tuning. See the ticket I just created above. Yes, pros have been pinged (who can comment on stability of the query pattern above).

-- MyGene Helpdesk Technician

cyrus0824 commented 6 years ago

Hi everyone. I thought I would chime in with an idea here. You can specify the relative scoring weights of different search terms in our current API spec using the "^" operator. It only works for fielded queries though. For example, the scheme @dhimmel described could be implemented using a query like this:

https://mygene.info/v3/query?q=symbol:KRA^5 OR alias:KRA^3 OR symbol:*KRA*^2 OR name:*KRA*^1

Basically it's an exact mach for symbol (weighed at 5), an exact match for alias (weighed at 3), a wildcard for symbol (weighed at 2) and a wildcard for name (weighed at 1). The query still weighs human, mouse, rat hits higher as well...

Hope this helps

Cyrus

dhimmel commented 6 years ago

@cyrus0824 thanks for that info! Should the second symbol match be symbol:KRA*^2 for the wildcard?

Is it possible to put the entrezonly=true and species=human constraints on that query? If so, then I think all of our needs would be met.

cyrus0824 commented 6 years ago

@dhimmel yes...it interpreted those two * as italics.... I updated the post above :)

cyrus0824 commented 6 years ago

About the entrezonly=true and species=human, yes those will both work fine (as well as any other options)...

cyrus0824 commented 6 years ago

There is also the related issue that @andrewsu opened on mygene.info. The response I put there might be another option for you. If you can't make the query string work as you want, you can quite easily define your own query on mygene. See https://github.com/biothings/mygene.info/issues/32 for more information if you're interested...

dhimmel commented 6 years ago

How does matching multi-word fields work? Is each word it's own token or are spaces just like any other character. For example, let's use the gene name KRAS proto-oncogene, GTPase. Would a query of proto-onco match query* or only *query*?

cyrus0824 commented 6 years ago

@dhimmel I believe for the "name" field, it is tokenized on whitespace, so it should match both query and *query\. Be careful with double escaping the "-" , I believe it's a reserved character. You can find out more (including the reserved characters) here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax

dhimmel commented 6 years ago

@ramenhog let's see if the following change to our query improves results:

q=type_of_gene:protein-coding AND (symbol:SEARCH^5 OR alias:SEARCH^3 OR symbol:SEARCH*^2 OR name:SEARCH*^1)

where SEARCH is the user-supplied search term. Keeping species, entrezonly, and size as they are now.

ramenhog commented 6 years ago

@dhimmel 👍 i'll try it out. Do we still want suggest_from as a param?

dhimmel commented 6 years ago

Do we still want suggest_from as a param?

I don't think so, since we are now explicitly specifying the fields to search in the query.

ramenhog commented 6 years ago

It's looking good 🎉 ! These are the results I get after using that updated query with the use cases I initially posted:

KRAS: screen shot 2018-04-09 at 2 23 24 pm

BRCA2: screen shot 2018-04-09 at 2 23 17 pm

TP53: screen shot 2018-04-09 at 2 23 36 pm

Please let me know if there's any other cases I should test. If it all looks good, I'll get a PR out for fixing this 👍

cgreene commented 6 years ago

Woo! 🎉 This looks much better!

dhimmel commented 6 years ago

If it all looks good, I'll get a PR out for fixing this

Let's PR. We'll deal with remaining issues as they come up as this is already a huge improvement.

cyrus0824 commented 6 years ago

Hi everyone, glad this is starting to look better. A couple more things I wanted to suggest:

1) The weights I picked in the initial example were arbitrary, and could result in extreme overweighing of certain types of hits (this may or may not be what you're interested in). In general we've had the best luck picking weights between 0.5 and 1.5, as it seems not to squash the inter-hit scoring differences. You can see what I'm talking about in the _score of each hit in the weighted query (some are ~350, others are ~5). You could probably get the same hit ordering using less exaggerated weights...

2) Specifically @dhimmel : after looking into the elasticsearch text segmentation algorithm, I saw that the "-" character is actually a token delimiter :( This means that "proto" and "oncogene" are separate tokens, and at best "name:proto-onco" would need to be searched like "name:proto AND name:onco".