ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Variant Annotation branch - initial API draft #290

Closed sarahhunt closed 8 years ago

sarahhunt commented 9 years ago

Comments are invited on the initial draft of the variant annotation API

https://github.com/ga4gh/schemas/tree/variation_annotation

adamnovak commented 9 years ago

Here are my comments.

  1. It looks like there are a few places (like SearchVariantsRequest) where strings are used but OntologyTerms might be more appropriate.
  2. Should SearchAllelesRequest be modified like SearchVariantsRequest is? Or, more generally, are Alleles going to be annotated with effects, or only entire Variants?
  3. Shouldn't AnnotateVariantsRequest take variant IDs as opposed to entire variant objects? If the idea is that you are sending in new Variants that don't yet exist on the server, wouldn't you also need to be able to send the new Alleles that they might reference (since a Variant in graph mode is no longer required to contain any sequence data itself)?
  4. How compatible is this with the g2p work in #268? It seems like there is some overlap with respect to annotating the effects of variants.

On Tue, Apr 21, 2015 at 10:05 AM, Sarah Hunt notifications@github.com wrote:

Comments are invited on the initial draft of the variant annotation API

https://github.com/ga4gh/schemas/tree/variation_annotation

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/290.

sarahhunt commented 9 years ago

Thanks for the comments!

Re 1. Thanks, this is updated now - it was an oversight.

Re 2. Annotation is allele-specific, so we intend to update the SearchAllelesRequest and Allele records, but I was not sure they were stable yet.

Re 3. The idea behind this method is that data extracted from one GA4GH server can be annotated using a different one, so yes - an AnnotateAllelesRequest will be needed.

Re 4. Someone from the G2P group may like to comment, but as I see it G2P seeks to describe previously observed associations between features (obviously, including variants) and phenotypes, while these changes describe predicted outcomes when the variants are compared to other features. G2P FeaturePhenotypeAssociation's will be added the the Annotation record when they are stable as we will want to report associations on overlapping/co-located variants.

(Edited as my initial response was impressively mangeled)

dglazer commented 9 years ago

I think there are two things to discuss here -- the overall model for how annotations fit into the API, and the details of what fields are in an individual annotation. This comment addresses the first one. As a starting point for discussion, here's my understanding of the model (based on https://github.com/ga4gh/schemas/compare/variation_annotation; if there's a better way to see the substance, please let me know).

Is that right? If so, that seems very parallel to today's VCF-based model, where you can run tools (e.g. Annovar) that add columns to a VCF file. Questions:

I'm sure others who are closer to the bio will have followup questions, but let's start by making sure we're all starting on the same page.

pcingola commented 9 years ago

@dglazer: The overall understanding of the model is correct.

Question 1: It is true that annotateVariants should have a set of parameters to specify which algorithms to run in the annotation, we should add that. In my opinion, there should be a sensitive default (should this be implementation specific?). For instance, in most cases it makes sense to add at least functional annotations, after all that's the goal of 'annotateVariants'.

Question 2: Some of this is maintained implicitly in AnalysisResult which tracks version numbers. But perhaps what you are asking is for the introduction of a "AnnotationSet" object. Is this correct?

Question 3: One of the goals here was to simplify the old PR #126. In that PR many people contributed interesting ideas but after adding most of them the PR turned out "to big and complex to be accepted". Fortunately those ideas are now part of new groups such as g2p, ontologies and genomic-features; so we can just refer to those objects.

andrewjesaitis commented 9 years ago

:+1: on the content addressable data id field.

A field that was listed as mandatory in the VCF Annotation Format Doc, that @pcingola and others created, that doesn't seem to be present is "Putative_impact". Should this be included in the schema?

Another field that might be useful is a "warning" field that was proposed in that same document. This was listed as an optional field. However, I think the defining difference between a warning field and the other field listed as optional (eg distance to feature) is that it is not computable after the annotator has run. These warning and errors need to be at the level of variant-feature pair as they do not apply to the entire analysis run. Additionally, this field could be useful to identify difficult to annotate variants programatically.

Finally, with respect to @dglazer's question 1, the schema doesn't seem to support an arbitrary key,value map, correct? (Maybe this is the purpose of array<AnalysisResult>?) In any case, should the default set of annotations to produce just be the fields defined in the Annotation schema? Even with an arbitrary mapping field, this still seems like a sensible default sense the schema covers all the mandatory fields defined in the variant annotation format doc.

dglazer commented 9 years ago

Thanks @pcingola; more inline.

Question 1: It is true that annotateVariants should have a set of parameters to specify which algorithms to run in the annotation, we should add that. In my opinion, there should be a sensitive default (should this be implementation specific?). For instance, in most cases it makes sense to add at least functional annotations, after all that's the goal of 'annotateVariants'.

I think the number of possible annotation algorithms / flavors is huge, so I doubt any pre-defined set of parameters can be close to complete. One option is to let each implementation define their own custom params. Another option is to instead provide a writeAnnotations call, so that people can write annotation clients that use whatever algo they want, and store the output via the GA4GH API. Another option is a different overall data model -- see below.

Question 2: Some of this is maintained implicitly in AnalysisResult which tracks version numbers. But perhaps what you are asking is for the introduction of a "AnnotationSet" object. Is this correct?

Yes, or more precisely, AnnotationSets are one way of addressing the requirements I'm asking about. I think there are always going to be multiple ways to annotate any given variant, and ideally our API would support users who wanted to work with more than one of them, either at a time or over time.

From first principles, I think of one variant being associated with many possible annotations, and with many possible calls; annotations aren't directly associated with individual calls. That model is roughly captured in annotated VCFs by having one row per variant, with a column for each call, and a column for each annotation. I understand why that makes sense in a file-centric world, but it has the problem of needing to re-build and re-distribute the whole VCF, including all the calls, whenever you want to work with new or changed annotations.

In our API-centric world, I think we can do better, by defining a concept like AnnotationSet to manage multiple possible variant-to-annotation mappings, without conflation with the existing variant-to-call mapping. If that model makes sense, we can talk about API details; I believe CH has already shared some initial thoughts elsewhere.

Question 3: One of the goals here was to simplify the old PR #126. In that PR many people contributed interesting ideas but after adding most of them the PR turned out "to big and complex to be accepted". Fortunately those ideas are now part of new groups such as g2p, ontologies and genomic-features; so we can just refer to those objects.

I think this PR differs from #126 in two ways:

sarahhunt commented 9 years ago

@dglazer - thanks for the comments - we are clearly seeing this from different angles.

The key use-case we are considering is a set of variants being submitted to an annotation server, results calculated on the fly and returned - in essence, making GA4GH-compliant VEP, SNPEff, etc servers. Your comments suggest a model of multiple pre-calculated sets of annotation which can be mined, which makes total sense. ( We haven't been able to do this, as we don't know in advance what data will be submitted. Reproducibility is handled by tying specific software versions to specific reference data releases - not currently modelled well here.)

We chose to add the Annotation record to the Variant record ( and later planned to add it to the Allele record) as it keeps the API simple for end users and we don't need to worry about keeping location/path/etc definitiions in sync across two different variant/allele records. As calls are optionally returned, annotation could be optionally returned too.

Having explained how we got here, more details on what you are suggesting would be good.

sarahhunt commented 9 years ago

I've made a few updates based on comments so far.

@andrewjesaitis - thanks for the comments - I've copied 'impact' in from the previous PR and added support for arbitrary key,value pairs. I would hope most information would fit the AnalysisResult format, but guess some will not. Would the warning information you mention fit this structure?

I've grouped the location information in a new allleleLocation record. @pcingola - what do you reckon to this? I'm not sure I like the name, but the concept is tidier.

@dglazer - I've not re-structured the variant<-> annotation relationship, pending further discussion, but I have added a draft annotation to start that debate.

Thanks all!

dglazer commented 9 years ago

@sarahhunt , thanks for your comments -- reading them, I realize I might be confused about something pretty fundamental in your model. Earlier I said:

introduce a new annotateVariants method that, given an array of Variants, uses an unspecified algorithm to associate an array of Annotations with each one. (Which I think implies that every "row" in a VariantSet can contain a single array of annotations.)

That implies that annotateVariants actually changes the contents of the VariantSet, with the results being persistently stored by the server, so that future users can retrieve Variants with annotations already in place. Is that right? When you say:

The key use-case we are considering is a set of variants being submitted to an annotation server, results calculated on the fly and returned

that sounds more like an ephemeral one-time lookup, with nothing being persisted on the server. Which model do you have in mind? (I don't think either is better on all dimensions; I just want to be sure we're discussing the same thing.)

Also -- sorry if I missed this earlier -- is anyone implementing a prototype of these new annotations methods in parallel with them being defined, or are we still in the rough consensus stage, with running code to follow?

pgrosu commented 9 years ago

@dglazer Are you also performing effect prediction such as with SnpEff? Could you help me by pointing me to where CH posted something regarding this, since I can't seem to locate it.

Some annotation tools have been implemented with parallelism in mind, just depends on the scale and design in type of parallelism we would like to take. For instance GATK's VariantAnnotator has this using the -nt argument via the TreeReducible interface:

http://gatkforums.broadinstitute.org/discussion/2266/using-the-gatk-api-to-annotate-my-vcf

https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php

Also SnpEff from Pablo (@pcingola) provides multithreaded support which can be used with the -t option.

It would be helpful if you could diagram something regarding the design you have in mind. I think I know what you mean, but you know what they say about assuming things :)

As an aside, maybe it might be helpful having a list with examples somewhere - including the inputs and outputs - of what is currently out there, just as reference, so that we could start from the same point. Then we can tweak several example code-snippets with the purpose of brainstorming ideas regarding the scalable approach(es) - with the scope of the feature-set - that we would like to ensure for our API to support.

Thanks, Paul

sarahhunt commented 9 years ago

Sorry @dglazer - I realise I asked for comments (after a long discussion on the variant annotation call) without giving much explanation which is likely to confuse anyone not on the call!

We at Ensembl and @pcingola provide annotation tools to which people submit variant data which may be novel, so we calculate annotation on the fly and do not store the submitted variants or annotation. This is what motivated the annotateVariants method - we hoped to make these tools GA4GH compliant. The method was not intended to change any stored data.

@jacmarjorie raised the requirement for searching variant sets which have pre-calculated annotation which prompted the additions to searchVariants.

The idea behind updating the variant record was to group a set of annotation records within each variant ( and later allele) record, so if the response is a batch of a few thousand variants, it is simple to loop through all annotations for each variant at a time, applying filtering logic as required. I quite liked the idea of being able to search for all potentially damaging variants across a gene region and the genotypes of these variants in a trio, but maybe this is better done as seperate calls.

I have a prototype of the new methods - it's pretty rough as I don't want to spend time on optimisation until we have consensus, but it was needed to test the proposal would work.

andrewjesaitis commented 9 years ago

Looks good to me @sarahhunt.

Yes, I think warnings/errors can easily fit into the key/value store. And it looks like all mandatory annotation fields are now all top level fields and all the optional outputs can fit in info.

Thanks!

calbach commented 9 years ago

A few comments following offline discussion:

After thinking further, I think there are two logical pieces to this PR: (1) defining a variant annotation resource: how they are persisted, how can be queried, how they are grouped e.g. annotationSets and (2) adding a stateless /variants/annotate method which utilizes this resource. (1) is independent. (2) depends on (1).

Variant annotation resource

I think it's really important to get this right independently. My only major concern with the current proposal is around putting annotations directly on the variant. I see that it adds client convenience, but it also adds a good deal of complexity to the already large searchVariants endpoint; implementations will have to support or handle all combinations of annotation and call-specific fields. It also feels like there's some unnecessary conflation with call data. Perhaps this will be cleaner on the allele and genotype resources when those resources are finalized.

The alternative would be to have Annotation as a separate object and refer to the annotated variant by allele/genotype and/or variant id.

/variants/annotate

Right now, I'm wondering whether this should be an application built on top of the GA4GH API, or part of it. I think the trick to adding it as a method will be in striking the balance between specificity of behavior (and usefulness to clients) and feasibility of implementation. I imagine that the true full list of parameters to this method is quite long and complicated. A few comments:

bartgrantham commented 9 years ago

I also haven't been on the annotation calls, although I think I will start because it sounds interesting.

The key use-case we are considering is a set of variants being submitted to an annotation server, results calculated on the fly and returned - in essence, making GA4GH-compliant VEP, SNPEff, etc servers.

+1! All I'd like to see is for the schema/protocol to support annotation as a service.

In my mind that means the annotation server acting as a proxy to the data API, streaming variants through it and annotating them as they pass, ie. Annotator pulls variants from data service. The implication of that is that the request type of the annotation server contains the address of the data server and the data request payload in addition to the annotation-specific fields (database versioning, etc.). Alternatively it could be defined as a superset of a variants request with the annotation-specific fields added, but I feel like the former is cleaner.

I may be mistaken, but the model being proposed appears to be one where the user uploads data to the annotation service, which then spools it up and returns it, ie. User pushes variants to annotator. If this is so, I worry about scalability problems. The "annotator pull" model allows annotators to work like generators and doesn't require them to keep a lot of data in-flight in order to service a request. Also, in the "user push" model the variant data is using triple the user's bandwidth (and from a security perspective increasing the surface area for attack).

that sounds more like an ephemeral one-time lookup, with nothing being persisted on the server. Which model do you have in mind?

If there's sufficient database versioning in the request/response, what's the difference if it's computed on the fly or if it's cached? Maybe it hasn't been explicitly stated, but IMHO supporting referential transparency is a critical goal here for both research (reproducibility) and clinical use (liability).

pgrosu commented 9 years ago

There is no reason not to have both approaches implemented. So the annotation process can go through multiple phases:

  1. An initial one to map to Annotation data stores/objects that have become more stable.
  2. A secondary, more customized one where the variants are pushed through a service to provide a more comprehensive result including effect prediction, among other things.

@calbach, could you please provide some documentation/diagrams/code-examples on what you have in mind. Then we can focus through those examples.

@bartgrantham, I understand the concern regarding scalability, but these days those have already been solved in the areas of distributed/parallel algorithms and data-structures. Once we have some concrete diagrams and logic behind how the different approaches should be implemented - including code examples - then we can adjust accordingly and categorize each approach.

Thanks, Paul

pgrosu commented 9 years ago

Just got this right now from the Broad. It's a service for annotating cancer-related variants that draws information from 14 publicly available cancer resources - so this could combine both approaches:

http://www.broadinstitute.org/oncotator/

ornator1

ornator2

sarahhunt commented 9 years ago

@calbach - thanks for the comments. After discussions with you and @diekhans, I've moved the variants/annotate to a separate annotateAllelemethods protocol. As there are advantages and disadvantages to both approaches to the variant <-> annotation relationship, I thought it would be worth discussing it with the broader group later today.

@bartgrantham - nice idea to have the annotator pulling variants from a specified data server! We don't expect all variants requiring annotation to exist in public collections, which is why we wanted to support the submission of variant objects, but that doesn't mean we can't allow both models.

pgrosu commented 9 years ago

@sarahhunt I can't seem to find where annotateAllelemethods is defined. Could you please point me to the file and the time/place that the discussion will take place?

Thanks, Paul

bartgrantham commented 9 years ago

@sarahhunt - I actually wasn't thinking about public datasets as much as I was about the GA4GH APIs becoming industry standards. Some vendors can sell a data management layer, some vendors can sell an annotation layer, etc. with all layers only within a single customer's network. My mind is really on a clinical use case where a "pluggable layer" approach would be most welcome.

IMHO, the UI (or program) where a user is presented with (or analyzing) the annotation data shouldn't need to juggle querying and collecting the variants, then shuttling them to the annotation servers, then collecting the resulting annotated data. But the more I think about the separation of concerns WRT parallelization, I'm thinking that ultimately this kind of API layering will need a mechanism for demultiplexing/multiplexing the data streaming between layers. Probably getting ahead of myself. :)

sarahhunt commented 9 years ago

@bartgrantham - I see what you mean. You're thinking a bit further ahead than I was. I've added a comment about this for now until we try some prototyping.

pgrosu commented 9 years ago

@sarahhunt I think @bartgrantham might be referring to some of the ideas I posted a month ago at the following link:

https://github.com/ga4gh/schemas/issues/264#issuecomment-90081657

In any case, one step at a time :) Let's go through https://github.com/ga4gh/schemas/pull/302 and hash it out, and then go through the testing of it via server, and after that a scalable proof-of-concept, since there are several approaches in that domain we can try out.

I'm excited - it'll be fun ;) ~p