harindra-a commented 7 years ago

Automated aggregation of Matchmaker Exchange network operational statistics

1.Abstract:

The Matchmaker Exchange (MME) has grown into an active, dynamic collection of distributed systems that operate largely independently of each other. This independence has led to challenges in assessing the growth as well monitoring the operational health of the network as a whole. While most MME members gather operational metrics, we do not have an effective means of sharing and aggregating these in a scalable, effective manner. Integrating in such a mechanism for the automated gathering of metrics would not only be a powerful means of gauging how effective we are; it will also set us on the path to continuous improvement.

2.Implementation strategy:

We propose that implementation be done in two stages.

Phase-1: Implement a basic subset of metrics that can be agreed upon and implemented easily. This phase will setup the infrastructure for incremental addition of relatively complex metrics that need further discussion and methods development. (August 2017)

Phase-2: Leveraging the infrastructure put in place in Phase-1, we will incrementally add increasingly complex metrics. A characteristic of these metrics would be that they require new algorithm development and/or political/policy discussion.

3.Proposed completion-goals and timeline:

Phase	Design time	Implementation time
1	3 months	4 months
2	1 year	1 year

4.Proposed architecture:

Each member center expose a single HTTPS GET endpoint: /metrics
“/metrics” WILL require authentication and will be an endpoint for any MME node interested in operational metrics. None of the data shared in this endpoint would be sensitive, merely metrics.
“/metrics” will return a JSON object containing named metrics from the node.
One or more aggregation servers -we could use Exchange Servers for this purpose- would aggregate these public “/metrics” endpoint and provide an instant snapshot of the network as a whole. These servers will be discussed in Phase-2
The main Matchmaker Exchange website would have a dynamic page that would query one of these “Aggregation-servers” to get an instant snapshot of network health that it will visualize. This will be discussed in Phase-2

proposal_ automated aggregation of matchmaker exchange network operational statistics

6. Finalized Phase-1 metrics

Notes:

Endpoint “/metrics” will require authentication
format: JSON
While it is highly encouraged to implement all these fields given the benefits to fund-raising, visibility, network health, given that resources are limited, please implement as many of these as possible.

Fields are:

{
"metrics": {
    "numberOfSubmitters": 0, 
    "numberOfUniqueGenes": 0, 
    "numberOfUniqueGenesMatched" : 0
    "numberOfUniqueFeatures": 0, 
    "numberOfCasesWithDiagnosis": 0, 
    "numberOfCases": 0, 
    "meanNumberOfGenesPerCase": 0.0, 
    "meanNumberOfVariantsPerCase": 0.0,
    "meanNumberOfFeaturesPerCase": 0.0, 
    "numberOfRequestsReceived": 0,
    "numberOfPotentialMatchesSent": 0
    }
}

Field definitions are,

Field	Default value (implies type)	Definition
numberOfSubmitters	0	the number of PIs(or data-owners) who submitted cases -that are available for matching- to that node.
numberOfUniqueGenes	0	the number of unique genes that the node has available to be matched on by other nodes of the MME
numberOfUniqueGenesMatched	0	of the number of unique genes available to be matched, how many of them actually got matched to requests
numberOfUniqueFeatures	0	the number of unique combinations of features available for matching.
numberOfCasesWithDiagnosis	0	the number of patients with diagnosis (as per MME API field) available for matching.
numberOfCases	0	the total number of cases available for matching within the MME
meanNumberOfGenesPerCase	0.0	the mean number of genes in a case in that node. Only cases available for matching can be considered. This is a proxy for the richness or descriptiveness of the cases of a node.
meanNumberOfVariantsPerCase	0.0	the mean number of variants in a case in that node. Only cases available for matching can be considered. This is a proxy for the richness or descriptiveness of the cases of a node.
meanNumberOfFeaturesPerCase	0.0	the mean number of features in a case in that node. Only cases available for matching can be considered. This is a proxy for the richness or descriptiveness of the cases of a node.
numberOfRequestsReceived	0	the number of match requests the node gets
numberOfPotentialMatchesSent	0	the number of matched results the node sends out

Relequestual commented 7 years ago

After a recent DECIPHER meeting, we have agreed will be part of this new endpoint and provide stats, with the given condition:

We would need to see a draft of any aggregation or display of data before making such aggregation or displays available to others, to ensure a fair representation for DECIPHER.

It should be considered when anyone is constructing stats which include DECIPHER, that it may take one to two weeks for approval or feedback (but we expect this would be much quicker).

I would expect that after this has been done a few times, we may be able to relax this requirement.

harindra-a commented 7 years ago

Hi Ben, This is awesome news! I think all of your conditions make sense given the complicated nature of the data you carry and the metrics of them. The discussion on aggregation can be kicked-off in the face-2-face meeting and then focused in-on once we get to Phase-2 of metrics later in the year.

Great news!

Best, Harindra

On Tue, Jun 6, 2017 at 5:28 AM, Ben Hutton notifications@github.com wrote:

After a recent DECIPHER meeting, we have agreed will be part of this new endpoint and provide stats, with the given condition:

We would need to see a draft of any aggregation or display of data before making such aggregation or displays available to others, to ensure a fair representation for DECIPHER.

It should be considered when anyone is constructing stats which include DECIPHER, that it may take one to two weeks for approval or feedback (but we expect this would be much quicker).

I would expect that after this has been done a few times, we may be able to relax this requirement.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ga4gh/mme-apis/issues/140#issuecomment-306431997, or mute the thread https://github.com/notifications/unsubscribe-auth/AOf2_JemfD014KILielyFPK44CU_cIWwks5sBRu7gaJpZM4Nwjmo .

Relequestual commented 7 years ago

Thanks Harindra. Just so all are aware, I'd like to bring this up at the in person meeting during a main session where everyone is present, just so all are aware of our current conditions for use of the metrics / stats endpoint. (maybe add an agenda item?)

I may go as far to suggest we COULD add a terms field to the json, much like the disclaimer for the match endpoint.

harindra-a commented 7 years ago

Hi Ben,

Sure, I think that is a good idea and provides maximum clarity. François is lead there and can confirm, but I recall there was an agenda item already in both general and tech(at least) sections.

I am certainly open to a "terms/disclaimer" field if others are and if that adds to the comfort level (of Decipher and possibly other future centers) of serving this endpoint.

Best, Harindra

On Tue, Jun 6, 2017 at 10:05 AM, Ben Hutton notifications@github.com wrote:

Thanks Harindra. Just so all are aware, I'd like to bring this up at the in person meeting during a main session where everyone is present, just so all are aware of our current conditions for use of the metrics / stats endpoint. (maybe add an agenda item?)

I may go as far to suggest we COULD add a terms field to the json, much like the disclaimer for the match endpoint.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ga4gh/mme-apis/issues/140#issuecomment-306496630, or mute the thread https://github.com/notifications/unsubscribe-auth/AOf2_CV2Yvku_dr1AU9WwL4_C8tWCGZuks5sBVyqgaJpZM4Nwjmo .

fschiettecatte commented 7 years ago

Following Baltimore MME meeting...

Replace:

meanNumberOfGenesPerCase
meanNumberOfVariantsPerCase
meanNumberOfFeaturesPerCase

With:

totalNumberOfGenes
totalNumberOfVariants
totalNumberOfFeatures

I suggest we add this for symmetry:

numberOfUniqueVariants

Add date when the metrics were generated in ISO 8601 format:

dateGenerated

Acceptable values are:

2017-06-27
2017-06-27T17:25:20+00:00
2017-06-27T17:25:20Z

Relequestual commented 6 years ago

I'm trying to formalise the work to be done for MME stats on DECIPHER.

I could find in our minutes from the Baltimore meeting if we had any discussion around the definition of the replacement terms, as I've made the following observation:

We want to know the number of features per case, however totalNumberOfFeatures is defined as

the number of unique combinations of features available for matching.

By this, I understand that this won't actually count the number of total pheotypic assertions across for all patients (that are matchable).

As such, I propose we have numberOfFeatures and numberOfFeatureSets. numberOfFeatures will give us the numerator needed to determin the meanNumberOfFeaturesPerCase which was the aim.

As such, I've created a json schema document in a gist. Please revise / comment as required.

https://gist.github.com/Relequestual/22508a0aeb7e4ec8a8a37a0379b194eb (If you're looking for a nice json editor, try http://jsoneditoronline.org)

I removed the total prefix from the field names as it doesn't fit with the others, and I've rephrased some of the descriptions to make them consistent across the document.

fschiettecatte commented 6 years ago

Indeed, we did not discuss replacements terms, I tossed something out for people to react to.

I am not sure I understand the meaning of features sets? Would that be submissions with features?

Relequestual commented 6 years ago

OK, let me explain.

Say we have three patients.

1 has phenotypes a,b,c. 2 has phenotypes b,c,d 3 has phenotypes a,b,c 4 has no phenotypes

totalNumberOfFeatures was defined as

the number of unique combinations of features available for matching.

Which there are two sets (combinations) of features. While this is iteresting, it doesn't give us the value required to determine the mean number of phenotypes per patient, which we want to be able to calculate.

Given the above example data:

numberOfFeatures: 9 - All observations including duplicates numberOfUniqueFeatures: 4 - Number of unique phenotypes covered by all patients numberOfFeatureSets: 3 - Effecitvly the number of patients with any phenotypes numberOfUniqueFeatureSets: 2 - as above but without exact duplicates

Maybe the last two are not required, but I would define numberOfFeatureSets the same as the definition given for totalNumberOfFeatures above, as it says unqiue combinations.

fschiettecatte commented 6 years ago

Got it, I am not sure how useful the last one is (numberOfUniqueFeatureSets) because phenotypes are subjective (to a degree), but I have no problem leaving it in.

Relequestual commented 6 years ago

I agree, I'm not sure it's useful either. Maybe that wasn't the intesion of the field description, but tha's how I would read it.

Relequestual commented 6 years ago

Agreed on MME call 2017/08/01 that we would remove numberOfUniqueFeatureSets, as it's computationally more complex to work out, and doesn't seem to hold any value.

harindra-a commented 6 years ago

I am fine with removing numberOfUniqueFeatureSets, let's take it out.

My goal originally was to get a basic first infrastructure in place supporting fields that all of us are comfortable with. Past that milestone, adding in new fields incrementally as needed, should be a lot easier, so definitely on board with taking out fields that folks feel add more trouble than they are worth.

ga4gh / mme-apis

/metrics API endpoint: Automated aggregation of Matchmaker Exchange network operational statistics #140

Automated aggregation of Matchmaker Exchange network operational statistics

1.Abstract:

2.Implementation strategy:

3.Proposed completion-goals and timeline:

4.Proposed architecture:

6. Finalized Phase-1 metrics