Competency Questions for Evidence/Provenance Modeling

mbrush commented 5 years ago

One of the big modeling tasks that we have not discussed in detail is representing evidence and provenance information supporting a given variant annotation statement. We have superficially considered and documented a some requirements in this space as part of the VA type requirements effort here. But what Javi and I would like to do next is to collect a rich set of competency questions (CQs) for each VA type that will be used to inform its evidence and provenance model.

Here we would ask the 'owners' and 'supporters' of each VA type who led the initial requirements work to also help with these CQ efforts - using the expertise they have accrued for their VA types, and familiarity with the data and needs of its users. An updated list of VA types and owners/supporters for this task is here.

We anticipate this work to take just 1-2 hours each, and we will provide details/assistance for this task on upcoming calls. Please respond here or email the VA list if you have any questions, suggestions, or concerns. Thanks all!

mbrush commented 5 years ago

A bit more on CQs:

A CQ is a simply an example of a query that a user might ask of some dataset or application.
A corpus of competency questions can be extremely useful to inform the scope, content, and structure of a data model - especially in cases like our where evidence/provenance modeling requirements are likely to have similarities and differences across VA types.
CQs related to evidence and provenance would concern things like who, when, or how a particular statement or annotation was made, what information was used as evidence supporting it, or how, when, or by whom the evidence was interpreted and generated.
CQs are most useful when they are specific and detailed - using specific examples of entities (e.g. say “NM_000314.6(PTEN):c.389G>A” as opposed to “variant X”, or “Polyphen” as opposed to “algorithm Y”)
We really want CQs to reflect the unique perspectives/needs of each VA (some overlap across VA types is good, but don't want entirely overlapping sets comprised of generic queries).
Collectively, CQs should span a diversity of variables/data types and query scenarios.
There is no such thing as a 'wrong' CQ - they can be as simple or complex as you wish, and even ones that our data or model may not be able to support can be informative. However we do want to make sure that the most important and commonly asked questions are well covered before asking more pie-is-the-sky type questions that are hard to support, or only a few users might want to know.

For some examples, see the CQ bank we assembled here for a project about modeling temporal aspects of cancer (CQs are in Section III), or the document here for a project about BRCA variant pathogenicity interpretation modeling.

Please keep in mind that for our task we are specifically after CQs related to the evidence and provenance (E/P) information behind an annotation/assertion - examples in the docs linked above may contain some examples of such CQs, but many are unrelated to E/P.

mbrush commented 5 years ago

It may help to think of three high-level categories of CQs:

Discovery CQs are simple queries that directly return data from a dataset, and require no calculation or analysis. These aim to return annotations with specific features ("Find annotations that . . . "). Here we consider the perspective of a user searching for annotations of a given type, and what aspects of evidence or provenance they would want to search/facet/filter on.
'Descriptive' CQs are also simple, but aim to return specified features of a known annotation ("For this particular annotation, what is its . . ."). Here we consider the perspective of a user looking to use a particular instance of an annotation they have in hand, and what aspects of evidence or provenance would be helpful for them to understand, trust, and apply it appropriately.
Analysis CQs present more complex research questions and use cases that require some calculation, statistical analysis, or other methods to be performed on the data to generate an answer. (e.g. "Researchers from what institution have provided the most publications cited as evidence for pathogenic BRCA2 interpretations over the past 10 years")

The CQ corpus here organizes its queries according to these types of CQs.

For our purposes, Discovery and Descriptive CQs are easier to produce and should be the focus of this effort, but you can also report any Analysis CQs you come up with as well.

ga4gh / va-spec

Competency Questions for Evidence/Provenance Modeling #27