ga4gh / va-spec

An information model for representing variant annotations.
15 stars 3 forks source link

Capturing unspecified conditions in Variant Pathogenicity Interpretations #25

Open mbrush opened 5 years ago

mbrush commented 5 years ago

This issue arose in the ticket here: https://github.com/ga4gh-gks/variant-annotation-model/issues/22#issuecomment-452561137. In pathogenicity interpretation sources such as ClinVar, it is common for the disease/condition slot to be blank, or populated with values like 'not specified' , or 'not provided'. In such resources, these values are often used inconsistently and not in line with their intended meaning. Steven Harrison and Larry Babb gave a nice overview of the issues and implications of this on the Jan 16 VA call (see minutes here).

There are many long standing and complex issues related to curation workflows and source data representation. We are not going to solve these for the community, but we can provide feedback/requirements based on our perspective and use cases. What we must do for our work is decide how our model will specify the value of slots requiring a condition when a source record does not provide one (e.g. the 'descriptor' slot of variant pathogenicity interpretations).

On the Jan 16 call we debated the merits of annotating to a root disease term like "Disease" in cases when no condition is provided. We realized that this is problematic in that the meaning of such an annotation would need to be different for variants classified as benign (where a blank value in the source data implies that the variant is benign for ALL diseases), versus pathogenic (where a blank value in the source data likely indicates that the variant is pathogenic for some disease, but not ALL diseases).

Additionally, a problem with providing values like 'not specified', 'not provided', or allowing a general 'Disease' root as values for this slot is that eventually people will not use it as intended - and this will be more misleading than anything else. Given that these values are used inconsistently in sources like ClinVar, we don’t want to add credibility to bad data by giving the user a sense that it is clearer or more reliable than it actually is. For example, if we map blank slots in source data to a generic/root 'Disease' term in our data, we mask the underlying ambiguity of the source annotation, and may give the user unwarranted confidence in the data.

We should consider use cases and CQs we need to support, and the potential dangers of the different choices available to us, and decide how to handle these cases in our specification.

@larrybabb @cbizon (would love to tag Steven Harrison as well if anyone has his handle)

mbrush commented 5 years ago

Given the considerations above, I would avoid use of a generic 'Disease' term, as it risks making assumptions that are not valid, and overstating our confidence in the data. Instead, allowing the descriptor to be blank seems the 'safest' choice. Our documentation would have to detail how users should interpret a blank disease/condition value, and the various reasons this might appear in the data. And the model would have to provide adequate provenance information for the user to find and evaluate the source of this annotation for themselves to decide if/how they want to use it. So, while I don’t like the idea of the descriptor not being a required field in our model, I cannot think of a better solution.

On a related note, our documentation should include guidelines on how to interpret annotations to 'grouping' disease terms (e.g. Rasopathies) that explain how to interpret the meaning of these for both benign and pathogenic classification. e.g. an annotation classifying a variant as benign for Rasopathies likely means it is benign for all subtypes, but a pathogenic classification of a variant for Rasopathies should be interpreted to mean that there is evidence related to multiple subtypes, and the curator is selecting this more general term that covers them. But it is not necessarily the case that the variant is pathogenic for ALL of its subtypes.

For an example, see the Clinvar record here about Rasopathies, as described in the MedGen hierarchy here.

larrybabb commented 5 years ago

@sharriso is Steven's handle. @sharriso please verify @mbrush notes above and comment if needed.

sharriso commented 5 years ago

Agree. If we pick a generic "Disease" term it will likely be something a lab selects at the time of submitting to ClinVar or sharing the data externally - as opposed to being a term the lab actually incorporated into their interpretation internally (in which case the term really is the same as saying nothing). I have been asked to generate a document for ClinGen Expert Panels regarding how to select their interpretation disease for all classification types (P, LP, VUS, LB, B) which could help here as well - because you have a great point that saying pathogenic for a general term (like RASopathy) means pathogenic for some disease under that general term while saying benign for a general term means benign for everything under it. We are going to advise ClinGen groups be as specific as possible on the Pathogenic side....while being as broad as possible on the Benign side.

mbrush commented 5 years ago

Thanks Steven - I think the document you are creating for ClinGen Expert Panels would be very useful for us to see as well, when it is in a state to share.

mbrush commented 4 years ago

After the August 7 call, I am reconsidering the idea of requiring a value for the Condition descriptor in Pathogenicity Interpretations, but providing a controlled set of terms to use when no condition is indicated (e.g. 'condition not specified'). The earlier argument against such condition values is that "eventually they will be misused" - but I don’t see this to be as much of an issue as initially thought.

Rather, I think that being explicit about a condition not being specified will make the data clearer. If we allow for blank/null values, it may be unclear whether a field is empty because the provider forgot to populate it, or because they intentionally left it out. And it will be harder to document the semantics/interpretation of a blank value if we don't define controlled terms with specific meaning to handle cases where no condition is specified.

The downside here is that there is existing data with null values, but it should be trivial to map these to a 'not specified' value during data transformation to the VA spec. Our job would then be to enumerate/define all scenarios where a condition may not be provided, and the possible interpretations of these scenarios. Once we do this, we can give names and definitions to terms to handle each. Our landscape and requirements analysis provides a nice start here:




Again, I would argue that we add value and clarity by defining and clearly documenting terms to distinguish at least some of these scenarios - even at the risk that they may be misused. We should consider how many and how specific to make these terms and distinctions. I would propose to keep simple initially, and define only a few, generic terms initially. Even if only a single 'NOT SPECIFIED' term.

AmandaSpurdle commented 4 years ago

RE"After the August 7 call, I am reconsidering the idea of requiring a value for the Condition descriptor in Pathogenicity Interpretations, but providing a controlled set of terms to use when no condition is indicated (e.g. condition not specified). The earlier argument against such condition values is that "eventually they will be misused" - but I don’t see this as an issue." pathogenicity interpretation HAS to be against a given disease condition. there may well be instances where is a variant is identified in eg a colorectal cancer patient but that variant does not underlie they disease predisposition. so if curated in that patient it should be clear if the finding explains the disease or is a secondary finding.

AmandaSpurdle commented 4 years ago

Next - it is entirely possible that both condition and assertion might be blank - this is the equivalent to not assessed . If blank fields are a problem then not assessed could be used as terms to fill in the blanks. We in the BRCA world have tried to get people to understand the difference between unknown pathogenicity assertion (not yet reviewed the evidence) and uncertain pathogenicity (reviewed the evidence and not sure if pathogenic or benign for a variety of reasons), However this recommendation for terminology remains poorly used, so i think better to be more explicit as per suggestions above.

AmandaSpurdle commented 4 years ago

it is possible to use prior knowledge about gene-disease relationship to curate a variant even if you know nothing about the presentation of the person it was found in - that is what incidental/secondary findings are all about. eg a truncating BRCA1 variant in an unaffected person or paediatric patient may or may not alter their treatment but has implications for family management. Further, if the point is to do geno-pheno discovery we should be encouraging max reporting of what we know. So that would be a different scenario again. It is for this reason - to separate knowledge of general variant pathogenicity to relationship to the presenting condition that we in ENIGMA are proposing multi-tier reporting (paper published recently: Towards controlled terminology for reporting germline cancer susceptibility variants: an ENIGMA report. bJ Med Genet. 2019 Jun;56(6):347-357. doi: 10.1136/jmedgenet-2018-105872. Epub 2019 Apr 8. PMID: 30962250)

AmandaSpurdle commented 4 years ago

I am still struggling with the concept of the term variant oncogenicity. i would strongly suggest that this is something that might be discussed in concert with ClinVar/Gen since they are doing a lot of working on trying to streamline terms (including for genetic risk factors identified by GWAS or moderate risk alleles for which there is varying opinion regarding clinical utility)

mbrush commented 4 years ago

The description of the MedGen term for a 'not provided' condition is interesting- and pertinent in particular to the ClinVar use case.

See https://www.ncbi.nlm.nih.gov/medgen/CN517202: "The term 'not provided' is registered in MedGen to support identification of submissions to ClinVar for which no condition was named when assessing the variant. 'not provided' differs from 'not specified', which is used when a variant is asserted to be benign, likely benign, or of uncertain significance for conditions that have not been specified."

mbrush commented 3 years ago

Feb 2021 Update - We are preparing to make a v0 release of the Variant Pathogenicity Statement, and need a decision here.

@larrybabb @ahwagner @javild others . . . please read this issue and weigh in with any recommendations.

@sharriso - as the most expert on this question, it would be great if you could weigh in at this point, with your current perspective / recommendations.

ahwagner commented 3 years ago

From the VICC perspective, Disease is always required, and we simply would not emit predisposing or oncogenicity statements lacking a disease context. If we were trying to validate pathogenicity statements made in ClinVar for use in MetaKB or other downstream applications, I can maybe see some utility if a variant is pathogenic but the disease not specified–but I believe such evidence would be very weak without an associated disease. I have no strong feelings about making the field optional or required.

If we make it required, I think backfilling the disease concepts from ClinVar with "Not Provided" as suggested above is a good solution.

AmandaSpurdle commented 3 years ago

it really would be good to get feedback from steven harrison. all i know is that disease in some form is required for submissions to ClinVar so it makes sense to me to require it for any pathogenicity statement for a variant

larrybabb commented 3 years ago

@mbrush @AmandaSpurdle can someone clearly state exactly what question(s) need a decision here? I think @sharriso and I would be able to help out with the ClinVar perspective if it was super clear.

In general, we are limited to convey and re-represent the submission in ClinVar. We cannot presume what the submitter meant, we can only take the data as it is and re-structure it into a SEPIO statement. If that statement has constraints that do not allow that then we will have to do something custom. Ideally, we'd like all the data in ClinVar to be supportable in VA, but we can accept that we may need to create specialized forms if ClinVars data is not reasonably transformable.

mbrush commented 3 years ago

@sharriso. Writing here to clarify the specific question we are asking, and context in which we are asking it. For full context, feel free to review the long thread above.

The issue debates how to capture data when there is no disease specified in a variant Pathogenicity Classification/statement. One approach is to allow the field for this data to be empty. Another is to provide one or more controlled terms to populate the field and be explicit about its emptiness. Minimally a 'disease not specified' term - but possibly a larger set of more specific terms that capture the reason for the missing value (unknown, not provided, inclusive of all diseases, etc.)

As discussed above, an empty disease field may have different meaning depending on whether the variant is classified as P, B, or VUS - and the acceptability of a missing disease may be different for B vs P vs VUS interpretations as well.

Importantly, we want to have a clear mapping from data in existing KBs like ClinVar, to our model. And we don't want our model to change the meaning or suggest more knowledge/confidence than we may have.

Whatever approach we choose, we will be sure to provide supporting documentation to recommend when and how to populate this field for different types of classifications (P, B, VUS), and how to interpret a missing value or an explicit 'not specified' term in the disease slot in these different contexts.

As for our specific ask, you had previously indicated that you were working on a document for ClinGen Expert Panels regarding how to select their interpretation disease for all classification types - and that this work, along with your general expertise in this area, might help inform our decision here. Can you provide:

  1. a recommendation about whether you think requiring an explicit term indicating that a disease is not specified is a good idea
  2. if so, what to call it? and should we consider additional terms as described above?
  3. what considerations should we keep in mind in making this decision?

Thanks - and feel free to respond here, or jump on a VA call soon if it is more efficient to discuss in real time.

sharriso commented 3 years ago

@mbrush - thanks for clarifying! And happy to join a VA call soon if that would be helpful.

I do agree that a disease term entity should be required for all assertions of pathogenicity...but with a broad approach to what this disease term entity can be. Ideally it would always be a real disease term but that won't always be the case. So I like "not specified" for LB/B classifications when you want to assert the variant is not disease causing with regards to any diseases. And "not provided" for P/LP classifications because in those cases, the person doing the classification does have an idea of what disease the variant causes, but that term or disease ID is not saved in a structured way. However I can also see terms like "See cases" being applicable for CNVs given that most CNVs are private and so there is often not a disease term that really covers the entire spectrum of phenotypes seen in that case. For these, sometimes ClinVar submitters will submit HPO terms in the Clinical Features field and then indicate "See cases" as the disease term.

mbrush commented 3 years ago

Thanks Steven. I am inclined to make the field required for our initial v0 release, and provide the two terms that aim to explicitly capture the meaning/distinctions you highlighted above for not specified vs not provided. The goal here is to make selection of the right term easy, and avoid 'over-interpretation' of source data that gets transformed into our spec. I think the two terms below achieve this. We can release them and see what issues are raised by users of the spec.

In summary, the not provided term is safe to use for any Pathogenicity Statement. The not specified term should be used only for benign/likely benign classifications, when we are confident that the submitter meant this to apply for any condition.

If we want to retain "not specified" in the label of the first term (e.g. because the difference between not specified and not provided, as used in ClinVar is broadly understood in the genetics community), we can change it to read condition not specified for a benign classification, or even just not specified - and rely on documentation to be clear about what this means.

Finally, re: Steven's comment about "see cases" when the condition is a constellation of phenotypes observed in a particular patient, rather than a named disease. Here, our 'Genetic Condition' model is able to define a condition in terms of a set of phenotypes. So this feature of the model can be used to represent the condition in such cases (instead of resorting to a "see cases" term).

@sharriso @larrybabb @ahwagner @javild - comment if you have any feedback or suggestions.

larrybabb commented 3 years ago

@mbrush great information and ideas above. Here's my take.

RE: all conditions, for a benign classification - We have wrestled for many years (and continue too) in using this concept. Here's some of the complexity that will not be easily resolvable.

  1. How do you know when a user meant this?
  2. What is meant by all conditions? We actually renamed the Medgen contrived term All Highly Penetrant Conditions to not specified because some folks didn't want or like the original term and decided that it needed to be a bit more ambiguous in order to meet the reality of how uncontrolled data capture is shared and submitted to ClinVar. So not specified fit the bill.
  3. VUS also (sometimes) uses not specified so replacing that with all condition, for benign classification would not make sense. One might find that using the label all mendelian conditions for a not clinically significant classification is a preferred label.

I'm sure there's even more history behind this complex concern.

At this point I think it is best for the GKS standards folks to step back and realize that our job is not to standardize the nuances and evolving terminologies and complexities of every statement type. At least not to the point where we are not delivering core and critical solutions that are reasonable yet imperfect.

So, I suggest that we consider the following....

  1. We back away from the GeneticCondition concept for a moment and reconsider what we can and can't control.
  2. Recognize that the ideal solution would be to have a "disease" or "condition" authority that would provide all the possible values to cover all the possible scenarios that our VA VarPath Classification and other statement types required for the concept of diseases.
  3. Since there will ALWAYS be disease terms that are not provided by an asserter, not specified by an asserter, or not defined by an authority making it impossible to specify and provide it by an asserter, that we provide a single disease term that represents not provided.
  4. Since there are many reasons (some of which we may not be able to yet enumerate) for why something isn't provided we leave that to the various implementers to handle using specializations and descriptors in their assertions to help clarify the reasons, makeup or set of values that may shed light on why it wasn't provided and what the non-provided disease might be.

This approach while not ideal, is quite reasonable IMO, because it solves a nasty problem with a very basic solution that allows us to move past it while enableing the users the freedom to describe these not provided disease scenarios as they see fit.

mbrush commented 3 years ago

On 3-31-21 VA Call we decided to go with a single 'not provided term'. Documentation/IG will make it clear how implementations might extend this value set with more specific terms, as per the comments above. It was even suggested that a Descriptor object may be the best place for capturing these extended semantics (once we begin treating Conditions as value objects). We can revisit this issue for v1, based on what we learn from DP testing / feedback