Allow for optional scoring break down.

Relequestual commented 9 years ago

We are currently developing our match scoring system. We are findng that phenotypic and genotypic matches may be very different, and that combining the score into a single score leaves out some important information.

I suggest we allow for additional score break down. Score for genotypic data match and phenotypic data match. At least, for us (Decipher), this would be something we would want to provide.

I know everyones busy right now, so I'll make a pull request on this as I have the time.

fschiettecatte commented 9 years ago

That is a good point. It may be best to provide a score for each of the sections, one for disorders, one for features and one for genomic features.

Relequestual commented 9 years ago

@fschiettecatte Do you intend to or already have implemented scoring for matches based on disorders annotated? Is there a use case where such a score could be useful?

fschiettecatte commented 9 years ago

@Relequestual It would tell you if there was a match there or not, GeneMatcher will map from Orphanet numbers to MIM numbers, and return MIM numbers. So looking at the data it may not be immediately obvious that a match was made that way.

Relequestual commented 9 years ago

That makes sense. How would you score it? Simply 1 or 0?

fschiettecatte commented 9 years ago

Yes. Right now I am just setting the score to 1 on all submissions that match. Not sure what a hit on a gene means, or a variant either. I need to do more thinking on this. Phenotypes are easier because one is looking at overlap.

buske commented 9 years ago

I completely support being able to provide additional scores, at all levels of resolution. We don't have disease scoring at the moment, but we did a while back and it was 0 or 1. Ideally you'd do something more sophisticated based on ORDO.

I think that whatever implementation we choose should be extendable so we can return scores for specific phenotypes, genes, variants, disorders. That means the a flat solution like the following probably isn't best:

{...,
"score": {
    "patient": 1.0,
    "disorders": 0.9,
    "features": 0.7,
    "genomicFeatures": 0.8
},
...}

I took a stab at doing something nested and it quickly turned into a mess, though. I can't think of a pretty way to provide an overall score for disorders, as well as a specific score for some of the disorders. Any suggestions?

Relequestual commented 9 years ago

Currently we're working out a way to do vairant scoring based on functional overlap (gene and consequence). We'll be running our ideas / current status through the higher-ups this week, and hopefully have a clearer picture of how they would expect it to work. I don't know if we'll be able to share this or not, but it's looking pretty good I think.

This means that we will have a per-varaint match score. I was thinking that, what would be most interesting to the user is the highest score, as I believe the user would want to find the patient that has the most similar vairant, and not the most similar group of variants. (THis isn't the case for phenotypes... where the UI score finds the most overlaped matches for a bag of terms.)

fschiettecatte commented 9 years ago

@buske Did you embed a score with each element in the entry, i.e. with each disorder, feature and genomic feature? That with a top level score would probably work.

Relequestual commented 9 years ago

I don't think you can do a score per feature, as the score is calculated based on the whole "bag" of phenotype terms.

buske commented 9 years ago

@fschiettecatte We haven't embedded any other scores at the moment, but that is absolutely an option. The caveat is that it is difficult to have an overall score for a field that is a list (like disorders, features, and genomicFeatures).

Relequestual commented 9 years ago

How to combine scores to a single score is a question I'm running past our project owner today. I'll let you know what he decides / comments.

Relequestual commented 9 years ago

IT has been suggested by Dr Hurles that we should use the Fisher's Combined Probability Test... aka the Fisher's method, for combining the two scores. I'm sure there are a number of scientific reasons behind this that I won't fully understand!

buske commented 9 years ago

Does this need to happen for v1.0 and the paper, or can this wait for a later version?

Relequestual commented 9 years ago

I was expecting it to be part of the 1.1 release, but that we should be aware that it's something that would really add value, and should be done soon! I'm not even sure we will have genomic scoring in time for our next release!

buske commented 9 years ago

@Relequestual sounds good!

Relequestual commented 9 years ago

Regarding your earlier comment https://github.com/MatchMakerExchange/mme-apis/issues/85#issuecomment-77063680 There's also the situation of comparing multiple vairants. My thought was that surely you take the MAX score for each comparison to arrive at the score for genomicFeatures for example.

We plan to combine the scores using Fisher's method for combining P values. Dr Hurles has suggested that even though there are more modern methods for combining P values, none of them stand out as being obvious improvements.

Relequestual commented 9 years ago

Technical call 4th August

Discussed posibility of cross cutting scores (scoring at each level where socring is required). Also discussed using JsonPath to allow detailed scoring.

buske commented 9 years ago

I've spent some time thinking about this and have identified 4 feasible approaches. Comments and additional pros/cons welcome. It would be great if everyone could vote on their favorite(s).

Hybrid approach: keep the top-level category scores in one object (like we have right now), and add per-item (e.g. per Ontology term) scores.
- Pros:
  - simplicity
  - scores for objects are within objects so it's easier to find them
- Cons:
  - no semantic reason why these two kinds of scores should be stored so separately
  - breaks down separation between patient data and query/matchmaking logic; to me, having a score within a feature suggests that the score is somehow data associated with that feature, rather than a score that was added on top as a result of matching this particular patient
JSONPath approach: return all scores in a separate object apart from the rest of the patient data and reference specific nodes using jsonpath
- Pros:
  - flexible/powerful: able to attach scores anywhere in the parse tree
  - easy to specify a score for a data item
- Cons:
  - requires separate library to parse
  - difficult to find the score for a specific data item

Match view approach: return scores in a separate match object including the data used to make that score. Something like:

{
 "results" : [
       {
     "match" : {
       "patient" : 0.8,
       "inheritanceMode" : {
         "score" : 0.2,
         "query" : {"id": "HP:0000006"},
         "match" : {"id": "HP:0012275"}
       },
       "features" : {
         "score": 0.6,
         "groups": [
           {
             "score": 0.5,
             "query": [{"id": "HP:0123456"}],
             "match": []
           },
         ]
       },
       "genomicFeatures" : {
         "score": 0.9,
         "groups": [
           {
             "score": 1.0,
             "query": [{"gene":{"id": "Ensembl:ENSG00012345"}}],
             "match": [{"gene":{"id": "HGNC:1234"}}]
           },
         ]
       }
     },
     "patient" : {…},
   },
   …
 ]
}

Pros:
- separation of responsibility
- easier to display and visualize what the match is actually based on
Cons:
- data redundancy between match and patient
- introduces new data model for handling scored groups of features/genomicFeatures

Replace lists in current data model with objects with an items field to allow adding fields, e.g. scores/weights, at that level. For example:
```
"features": [
 {
   "id": "HP:0123456"
 }, ...
]
```
would instead be:
```
"features": {
 "items" : [
   {
     "id": "HP:0123456"
   }, ...
 ]
 // "score": 0.1234 could go here on response
]
```
- Pros:
  - Simple to fix/update in code; no libraries required
  - Adds even more flexibility in the future, since now lists behave like other nodes and other fields can be added to them just as easily
- Cons:
  - No separation between data model and matching results
  - Adds some bloat to API

Relequestual commented 9 years ago

Vote for option 2. Seems most flexible and powerful, while not breaking anything that's currently there.

Relequestual commented 9 years ago

Possibly should have a way to identify which scoring algorithm the score is attained from.

fschiettecatte commented 9 years ago

Option 2 is fine with me.

Relequestual commented 9 years ago

A note on option 2. The library support looks good in most languages. Perl support is good, but there is an outstanding issue in the only libray that implements it, and this isn't going to get patched. There is a fix, but it won't be applied to the module (owner is MIA so to speak). May end up releasing my own version to combat this issue.

The issue would mean that we would need to mandate that the dot notation is used rather than quotes inside square braces. Both are valid according to the spec, but the perl implementation only supports the dot notation.

fschiettecatte commented 9 years ago

Ok, though I am still a little unclear how the JSONPath would be incorporated into the response? Just not really familiar how it works in practice.

Relequestual commented 9 years ago

Agreed on call that this is fine. Agreed we would for now do scoring only, and no additional information (for this issue). I will create some example json for comments. If this is OK, will make a pull request.

Relequestual commented 9 years ago

Here is my example json for what I expect additional scoring with a hybrid aproach, including json path, would look like...

{
  "results" : [
    {
      "meta" : {
        "scores" : {
          "patient" : 0.7,
          "features" : 0.323,
          "genomicFeatures" : 0.9
        },
        "extended_scores" : [
          { "path" : "/disorders[2]", "score" : 0.2 },
          { "path" : "/genomicFeatures[0]", "score" : 0.2 }

        ]
      },

We have previously talked about, and probably plan to, add a "meta" section for information not directly realted to the patient. In the example, I have gone for a hybrid approach of what this might look like regarding scoring.

The simple scores are in a "scores" object, consisting of mandatory patient score, and optional other scoring. I assume features and genomicFeatures. We can add to the list as required.

And, an extended scoring, where xpath is used to identify the individual element the match score is based on. For example, the 3rd disorer has a score of 0.2, and the first genomicFeature also has a score of 0.2. (Don't read into the numbers being combined in any way to make the scores in the main scores section. These score values are arbitary.)

Any thoughts or concenrs based on this?

      "patient" : {
        "contact": {
          "href": "http://www.ncbi.nlm.nih.gov/pubmed/22305528",
          "institution": "Children's Hospital of Eastern Ontario",
          "name": "Lijia Huang"
        },
        "disorders": [
          {
            "id": "MIM:610536",
            "id": "MIM:610453",
            "id": "MIM:610526",
            "id": "MIM:610536"
          }
        ],
        "features": [
          {
            "id": "HP:0008773",
            "observed": "yes"
          },
          {
            "id": "HP:0000413",
            "observed": "yes"
          },
          {
            "id": "HP:0000453",
            "observed": "yes"
          }
        ],
        "genomicFeatures": [
          {
            "genes": [
              {
                "id": "EFTUD2"
              }
            ],
            "type": {
              "id": "SO:0001587",
              "label": "STOPGAIN"
            },
            "variant": {
              "alternateBases": "A",
              "assembly": "GRCh37",
              "end": 42929131,
              "referenceBases": "G",
              "referenceName": "17",
              "start": 42929131
            },
            "zygosity": 1
          }
        ],
        "id": "P0000079",
        "label": "178_M43377",
        "test": true
      }
    }
  ]
}

fschiettecatte commented 9 years ago

This looks ok to me, I would rename 'extended_scores' to 'extendedScores' to keep with the formatting conventions we seem to be following so far.

Relequestual commented 9 years ago

Oh, of course. I've fallen for that a few times! Looking for an additional +1 =]

fschiettecatte commented 9 years ago

+1 here 😀

Relequestual commented 9 years ago

Need 1 more iirc

Relequestual commented 8 years ago

@buske can you review and +1 on this? I feel we should still include this in 1.1, as v2 may take some time given complexity and outstanding issues.

buske commented 8 years ago

Looking back at this, I think this is a reasonable solution to the use case you presented and with backwards compatibility as a requirement.

My reservation is the following: I think that in order for us to add this to the API, we should have at least one group that is committed to sending/serving this data (e.g. DECIPHER) and at least one group that is committed to receiving/displaying this data in results. Does any group want to step up? @Relequestual, I'd invite DECIPHER to start sending this data even before it is added to the official spec and make sure it goes well.

Relequestual commented 8 years ago

Alrighty then. I'll make a note that we want to do this. There are some outstanding issues surrounding MME within DECIPHER that need reviewing anyhow. Difficult to commit to a timeline till after our next planning meeting.

fschiettecatte commented 7 years ago

Following the Baltimore MME meeting, this was deferred to 2.0

ga4gh / mme-apis

Allow for optional scoring break down. #85