Open Relequestual opened 9 years ago
That is a good point. It may be best to provide a score for each of the sections, one for disorders, one for features and one for genomic features.
@fschiettecatte Do you intend to or already have implemented scoring for matches based on disorders annotated? Is there a use case where such a score could be useful?
@Relequestual It would tell you if there was a match there or not, GeneMatcher will map from Orphanet numbers to MIM numbers, and return MIM numbers. So looking at the data it may not be immediately obvious that a match was made that way.
That makes sense. How would you score it? Simply 1 or 0?
Yes. Right now I am just setting the score to 1 on all submissions that match. Not sure what a hit on a gene means, or a variant either. I need to do more thinking on this. Phenotypes are easier because one is looking at overlap.
I completely support being able to provide additional scores, at all levels of resolution. We don't have disease scoring at the moment, but we did a while back and it was 0 or 1. Ideally you'd do something more sophisticated based on ORDO.
I think that whatever implementation we choose should be extendable so we can return scores for specific phenotypes, genes, variants, disorders. That means the a flat solution like the following probably isn't best:
{...,
"score": {
"patient": 1.0,
"disorders": 0.9,
"features": 0.7,
"genomicFeatures": 0.8
},
...}
I took a stab at doing something nested and it quickly turned into a mess, though. I can't think of a pretty way to provide an overall score for disorders
, as well as a specific score for some of the disorders. Any suggestions?
Currently we're working out a way to do vairant scoring based on functional overlap (gene and consequence). We'll be running our ideas / current status through the higher-ups this week, and hopefully have a clearer picture of how they would expect it to work. I don't know if we'll be able to share this or not, but it's looking pretty good I think.
This means that we will have a per-varaint match score. I was thinking that, what would be most interesting to the user is the highest score, as I believe the user would want to find the patient that has the most similar vairant, and not the most similar group of variants. (THis isn't the case for phenotypes... where the UI score finds the most overlaped matches for a bag of terms.)
@buske Did you embed a score with each element in the entry, i.e. with each disorder, feature and genomic feature? That with a top level score would probably work.
I don't think you can do a score per feature, as the score is calculated based on the whole "bag" of phenotype terms.
@fschiettecatte We haven't embedded any other scores at the moment, but that is absolutely an option. The caveat is that it is difficult to have an overall score for a field that is a list (like disorders, features, and genomicFeatures).
How to combine scores to a single score is a question I'm running past our project owner today. I'll let you know what he decides / comments.
IT has been suggested by Dr Hurles that we should use the Fisher's Combined Probability Test... aka the Fisher's method, for combining the two scores. I'm sure there are a number of scientific reasons behind this that I won't fully understand!
Does this need to happen for v1.0 and the paper, or can this wait for a later version?
I was expecting it to be part of the 1.1 release, but that we should be aware that it's something that would really add value, and should be done soon! I'm not even sure we will have genomic scoring in time for our next release!
@Relequestual sounds good!
Regarding your earlier comment https://github.com/MatchMakerExchange/mme-apis/issues/85#issuecomment-77063680 There's also the situation of comparing multiple vairants. My thought was that surely you take the MAX score for each comparison to arrive at the score for genomicFeatures for example.
We plan to combine the scores using Fisher's method for combining P values. Dr Hurles has suggested that even though there are more modern methods for combining P values, none of them stand out as being obvious improvements.
Technical call 4th August
Discussed posibility of cross cutting scores (scoring at each level where socring is required). Also discussed using JsonPath to allow detailed scoring.
I've spent some time thinking about this and have identified 4 feasible approaches. Comments and additional pros/cons welcome. It would be great if everyone could vote on their favorite(s).
Match view approach: return scores in a separate match
object including the data used to make that score. Something like:
{
"results" : [
{
"match" : {
"patient" : 0.8,
"inheritanceMode" : {
"score" : 0.2,
"query" : {"id": "HP:0000006"},
"match" : {"id": "HP:0012275"}
},
"features" : {
"score": 0.6,
"groups": [
{
"score": 0.5,
"query": [{"id": "HP:0123456"}],
"match": []
},
]
},
"genomicFeatures" : {
"score": 0.9,
"groups": [
{
"score": 1.0,
"query": [{"gene":{"id": "Ensembl:ENSG00012345"}}],
"match": [{"gene":{"id": "HGNC:1234"}}]
},
]
}
},
"patient" : {…},
},
…
]
}
Replace lists in current data model with objects with an items
field to allow adding fields, e.g. scores/weights, at that level. For example:
"features": [
{
"id": "HP:0123456"
}, ...
]
would instead be:
"features": {
"items" : [
{
"id": "HP:0123456"
}, ...
]
// "score": 0.1234 could go here on response
]
Vote for option 2. Seems most flexible and powerful, while not breaking anything that's currently there.
Possibly should have a way to identify which scoring algorithm the score is attained from.
Option 2 is fine with me.
A note on option 2. The library support looks good in most languages. Perl support is good, but there is an outstanding issue in the only libray that implements it, and this isn't going to get patched. There is a fix, but it won't be applied to the module (owner is MIA so to speak). May end up releasing my own version to combat this issue.
The issue would mean that we would need to mandate that the dot notation is used rather than quotes inside square braces. Both are valid according to the spec, but the perl implementation only supports the dot notation.
Ok, though I am still a little unclear how the JSONPath would be incorporated into the response? Just not really familiar how it works in practice.
Agreed on call that this is fine. Agreed we would for now do scoring only, and no additional information (for this issue). I will create some example json for comments. If this is OK, will make a pull request.
Here is my example json for what I expect additional scoring with a hybrid aproach, including json path, would look like...
{
"results" : [
{
"meta" : {
"scores" : {
"patient" : 0.7,
"features" : 0.323,
"genomicFeatures" : 0.9
},
"extended_scores" : [
{ "path" : "/disorders[2]", "score" : 0.2 },
{ "path" : "/genomicFeatures[0]", "score" : 0.2 }
]
},
We have previously talked about, and probably plan to, add a "meta" section for information not directly realted to the patient. In the example, I have gone for a hybrid approach of what this might look like regarding scoring.
The simple scores are in a "scores" object, consisting of mandatory patient score, and optional other scoring. I assume features and genomicFeatures. We can add to the list as required.
And, an extended scoring, where xpath is used to identify the individual element the match score is based on. For example, the 3rd disorer has a score of 0.2, and the first genomicFeature also has a score of 0.2. (Don't read into the numbers being combined in any way to make the scores in the main scores section. These score values are arbitary.)
Any thoughts or concenrs based on this?
"patient" : {
"contact": {
"href": "http://www.ncbi.nlm.nih.gov/pubmed/22305528",
"institution": "Children's Hospital of Eastern Ontario",
"name": "Lijia Huang"
},
"disorders": [
{
"id": "MIM:610536",
"id": "MIM:610453",
"id": "MIM:610526",
"id": "MIM:610536"
}
],
"features": [
{
"id": "HP:0008773",
"observed": "yes"
},
{
"id": "HP:0000413",
"observed": "yes"
},
{
"id": "HP:0000453",
"observed": "yes"
}
],
"genomicFeatures": [
{
"genes": [
{
"id": "EFTUD2"
}
],
"type": {
"id": "SO:0001587",
"label": "STOPGAIN"
},
"variant": {
"alternateBases": "A",
"assembly": "GRCh37",
"end": 42929131,
"referenceBases": "G",
"referenceName": "17",
"start": 42929131
},
"zygosity": 1
}
],
"id": "P0000079",
"label": "178_M43377",
"test": true
}
}
]
}
This looks ok to me, I would rename 'extended_scores' to 'extendedScores' to keep with the formatting conventions we seem to be following so far.
Oh, of course. I've fallen for that a few times! Looking for an additional +1 =]
+1 here 😀
Need 1 more iirc
@buske can you review and +1 on this? I feel we should still include this in 1.1, as v2 may take some time given complexity and outstanding issues.
Looking back at this, I think this is a reasonable solution to the use case you presented and with backwards compatibility as a requirement.
My reservation is the following: I think that in order for us to add this to the API, we should have at least one group that is committed to sending/serving this data (e.g. DECIPHER) and at least one group that is committed to receiving/displaying this data in results. Does any group want to step up? @Relequestual, I'd invite DECIPHER to start sending this data even before it is added to the official spec and make sure it goes well.
Alrighty then. I'll make a note that we want to do this. There are some outstanding issues surrounding MME within DECIPHER that need reviewing anyhow. Difficult to commit to a timeline till after our next planning meeting.
Following the Baltimore MME meeting, this was deferred to 2.0
We are currently developing our match scoring system. We are findng that phenotypic and genotypic matches may be very different, and that combining the score into a single score leaves out some important information.
I suggest we allow for additional score break down. Score for genotypic data match and phenotypic data match. At least, for us (Decipher), this would be something we would want to provide.
I know everyones busy right now, so I'll make a pull request on this as I have the time.