ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 110 forks source link

Review terminology for predicted molecular impact #516

Open sarahhunt opened 8 years ago

sarahhunt commented 8 years ago

Discussion of the variant annotation documentation revealed concerns and potential confusion about the current classification of predicted molecular impact and in particular the term 'MODIFIER' to describe variants with no predicted impact on protein function.

This issue records the need to address these concerns and seek a prefered term.

majensen commented 8 years ago

I wanted to +1 @andrewjesaitis view expressed on the call today that the schema can represent "known" unambiguous facts from SO regarding variants, which can be grouped in any desired way at the application level. Also to +1 the ability nevertheless to store in the schema groupings that are commonly used or have other evidence or support (along with the provenance of the groupings). The concern I would have, and that was mentioned in the call, is that if the putative effects are baked into the schema without caveat, then that might be interpreted as an endorsement by GA4GH.

One razor the group could apply might be, for any schema-derived assertion, would a citation of "Variant Annotation schema, GA4GH" be acceptable support to you as a reviewer of a paper. "This is a frameshift variant" would probably fly, but "This is a variant of high effect in disease X" would have me looking for more evidence.

gaberudy commented 8 years ago

I was also +1 @reece comment about simplifying as much as possible.

My proposal on the call was to change the impact enum {HIGH, MODERATE, LOW, MODIFER} to a sequenceOntologies array<enum> { <all_so_terms> } as that provides an external authority to define the meaning of this term.

Deanne mentioned these annotations should have the program's providence provided, but I believe that is already the case as TranscriptEffect is provided as a part of VariantAnnotation, which references the varinatAnnoationSetId containing the "Analysis" meta-data that can include the program name, version and parameters.

While users are used to asking for "High" impact or "Loss of Function" variants from analysis tools, the goal of this API is to provide a data representation of called variants and community driven attributes that can be algorithmically assigned to them such as HGVS names and Sequence Ontology terms. Since the SO site does not provide these groups as part of the standard, and its easy to provide them as a grouping of SO terms at an application level, why place them in the schema?

pcingola commented 8 years ago

Correct me if I'm wrong, but the main idea of "putative impact" is to create a simple categorization for fast filtering purposes having sensitive defaults. It is true that people can use an enumeration of SO terms, but that becomes inconvenient very soon as the filter expressions grow. Removing this category would reduce the schema at the cost of increasing the work for everyone else trying to filter variants. The documentation is clear on what this is and it's limitations, so I'm not sure why Mark mentions "baked into the schema without caveat" (you are welcomed to change the wording to make is even more clear). I don't agree with the comment that we should not incorporate into the schema things that are not into HGVS or SO. Both standards have are evolving but still have issues, we should be able to do better for specific aspects that relate to common use cases.

On Tue, Jan 12, 2016 at 1:21 PM, gabeiscoding notifications@github.com wrote:

I was also +1 @reece https://github.com/reece comment about simplifying as much as possible.

My proposal on the call was to change the impact enum {HIGH, MODERATE, LOW, MODIFER} to a sequenceOntologies array { } as that provides an external authority to define the meaning of this term.

Deanne mentioned these annotations should have the program's providence provided, but I believe that is already the case as TranscriptEffect is provided as a part of VariantAnnotation, which references the varinatAnnoationSetId containing the "Analysis" meta-data that can include the program name, version and parameters.

While users are used to asking for "High" impact or "Loss of Function" variants from analysis tools, the goal of this API is to provide a data representation of called variants and community driven attributes that can be algorithmically assigned to them such as HGVS names and Sequence Ontology terms. Since the SO site http://www.sequenceontology.org/browser/obob.cgi does not provide these groups as part of the standard, and its easy to provide them as a grouping of SO terms at an application level, why place them in the schema?

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/516#issuecomment-171001521.

gaberudy commented 8 years ago

Thanks Pablo. This brings up something I struggle to understand with GA4GH.

You mentioned "common use cases " and supporting "filtering variants".

From what I understand, these schemas are about data interchange, not about being a direct backend for variant analysis, interpretation or filtering. Am I wrong?

If the users of the schemas are programmers building their own application abstractions, then why roll-up SO terms into something like this as opposed to letting that be the application developer abstraction. Who cares if that requires an internal filter that enumerates a dozen SO terms. That's abstracted away from some end-user by the application's design choices, no?

Or are these schemas going to be directly exposed to "end-users" through direct searching/filtering APIs? If so, where is the specification of those use cases that can be referenced to guide the design?

I struggle with this because there are plenty of comments on the API documentation, like these in Unresolved Issues that also imply the use cases for these APIs is not well defined or understood.

How can these types of choices, which are really "design" choices, be made without understanding the use cases/primary motivations to guide the judgement call or build an opinionated stance about what should or should not be baked into the schema?

majensen commented 8 years ago

@pcingola Fine, better to say "incorporated into the schema without provenance or reference also incorporated into the schema". The name of the grouping is additional information; High, Medium, Low aren't random tokens. If the schema is meant to be a standard, then the information in it presumably has either has a source (like an ontology), an definition ("this is a convenience grouping for speakers of English"), or is an axiom, i.e., "by authority of GA4GH Variant Annotation WG". I suppose my suggestion is to consider the ideal of associating one of source, definition, or "is an axiom" to any item of information expressed in the schema, and to do this as part of the schema, not of documentation.

pcingola commented 8 years ago

I don't think that "these schemas are about data interchange" only. There were a few talks about creating query languages, which would be more than pure data exchange.

The problem I see on leaving this to the application level abstraction is that everyone will end up creating their own version of it. If you think these categories are not perfect, wait until you see the mess when each user creates their own categories (after all, that was the real world use cases that lead us to develop the "putative impact" field). Again, this simple categorization may be a good default for the 95% of people who don't want to create heir own. Sophisticated users, such as yourself, can ignore these categories and create your own application level abstraction.

I'm not sure why you mention "without understanding the use cases/primary motivations", I think there is plenty of experience in the group to understand them (may be I misunderstood it).

On Tue, Jan 12, 2016 at 5:07 PM, gabeiscoding notifications@github.com wrote:

Thanks Pablo. This brings up something I struggle to understand with GA4GH.

You mentioned "common use cases " and supporting "filtering variants".

From what I understand, these schemas are about data interchange, not about being a direct backend for variant analysis, interpretation or filtering. Am I wrong?

If the users of the schemas are programmers building their own application abstractions, then why roll-up SO terms into something like this as opposed to letting that be the application developer abstraction. Who cares if that requires an internal filter that enumerates a dozen SO terms. That's abstracted away from some end-user by the application's design choices, no?

Or are these schemas going to be directly exposed to "end-users" through direct searching/filtering APIs? If so, where is the specification of those use cases that can be referenced to guide the design?

I struggle with this because there are plenty of comments on the API documentation, like these in Unresolved Issues http://ga4gh-schemas.readthedocs.org/en/latest/api/apigoals_intro.html#unresolved-issues that also imply the use cases for these APIs is not well defined or understood.

How can these types of choices, which are really "design" choices, be made without understanding the use cases/primary motivations to guide the judgement call or build an opinionated stance about what should or should not be baked into the schema?

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/516#issuecomment-171075372.

pcingola commented 8 years ago

@majensen here is the source for provenance: http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf

(not be ideal, but better than nothing)

gaberudy commented 8 years ago

@pcingola if this enum is intended to represent the VCF ANN data and its agreed that is a useful first class annotation to place in the schema, then it makes sense to just reference your PDF link above and close the debate on the naming of MODIFER.

To follow up on a couple points. I see RPC APIs in the spec for looking up variant sets / call sets by meta-data, but not returning filtered results based on these types of annotations. So I am still under the impression that these APIs are more about data interchange than browser/querying use cases (like say the ExAC browser, or UCSC browser).

That's what I mean by the use cases / motivations. APIs are designed for application developers, not users, so I'm thinking about these discussions in term of how I would as an application developer write a GA4GH compliant API wrapper around my own applications data stores and write client code to access/transfer data from other GA4GH API endpoints into my applications.

From a pure data transfer perspective, all annotations are superfluous, since they should be re-computable at any time from the variant definition. So in that sense the decision here is just about what darn useful annotations make sense to carve out representations for in the schema.

keilbeck commented 8 years ago

Would you like to manage the groupings of effect terms via the SO site/github?