Best practices for joining NGS-derived & clinical variation databases

ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24

http://ga4gh.org

Apache License 2.0

214 stars 114 forks source link

Best practices for joining NGS-derived & clinical variation databases #159

Closed mlin closed 9 years ago

mlin commented 9 years ago

This issue is meant for crowdsourcing knowledge about joining NGS/VCF-derived variant databases, which the GA4GH schemas are oriented towards, and clinical variation databases, which tend to prefer HGVS nomenclature for clinical genetics. By "joining" I do mean loosely in the SQL sense, that is, figuring out which records in different databases refer to "the same variant", however defined. There are numerous challenges here, such as agreeing upon reference transcript/CDS annotations and the many interesting normalization issues & corner cases inherent in both VCF and HGVS nomenclature, and it would be really useful if GA4GH could help reduce global duplication of effort/hair-pulling going forward.

Some links to kick off:

Code/libraries:

Invitae/23andMe HGVS Python library and associated publication (cc @reece)
Counsyl HGVS Python library (cc @mdrasmus)
macarthur-lab/leiden_sc (cc @andrewhill157)

My apologies for the many omissions undoubtedly committed here. Please pile on- links, cautionary tales, recommendations, etc.!

pcingola commented 9 years ago

Hi @mlin, The material you link is great and I'll go over it in detail to make sure we cover all the issues and corner cases in PR #126. I think that most of your recommendations are covered in PR #126, but please feel free top point out specific cases you feel are not addressed (or not clear enough). Also, the timing for your comments is good, since we still have to finish the methods section/s. It will be valuable to have your clinical point of view.

pgrosu commented 9 years ago

@mlin, I concur with @pcingola that this definitely is a great set of resources. We can at least summarize and consolidate the ideas and approaches.

Thank you for sharing, Paul

mlin commented 9 years ago

@pcingola Thanks - actually what might be most relevant here is if you would discuss your experience computing HGVS names for VCF variants in snpEff. What pitfalls did you encounter and what are the remaining gaps? The opposite direction is also useful. Basically, beyond the question of where should a HGVS name appear in a VCF-like schema, we have many groups (ourselves included) coming up with ways to do joins on the two representations, and it would be good to share knowledge/experience.

That is the first time I've heard "[my] clinical point of view" referenced, very flattering ;-)

pcingola commented 9 years ago

Indeed, we recently discussed some of these issues with ENSEMBL's VEP team and found a few common pitfalls that we agree should be addressed. One of my concerns about HGVS is that, in recent years, it seems to have evolved into a very sophisticated and complex nomenclature to cover all possible cases. Most (actually all) clinicians I met so far fail to understand HGVS variants for any, but the simplest cases. Nevertheless I think standardization is necessary and this particular issue will be solved as the notation becomes more popular.

lh3 commented 9 years ago

@mlin Are you worrying about "one variant, multiple representations"? If so, perhaps we could put the variant, no matter how complex the syntax is, back to the sequence context and then apply the VCF way.

Take the example from the GoldenHelix post. Suppose the REF is GAAC and we have a HGVS variant g.2_3delinsTT or g.2_3inv. In both cases, the ALT sequence is GTTC unambiguously. For this pair of REF/ALT, the VCF notation could be POS=2,REF=AA,ALT=TT, though there are other ways as well.

mlin commented 9 years ago

@pcingola Thanks - your conversation with the VEP team sounds like exactly the kind of information I was hoping might be illuminated. So I for one am in suspense :) To your second point, I agree but in fairness, I bet plenty of medical geneticists would have choice words about VCF, haha.

@lh3 That's certainly where I'd start as an NGS-biased individual, and (as you know better than I) there remain some subtleties in intersecting/merging records in the VCF data model. Others might do the exact opposite, as in the LOVD TR above. I wonder if anyone watching has experienced different approaches and learned some lessons about making the data most informative to support clinical interpretation.

sarahhunt commented 9 years ago

Hi Mike,

The two key issues we have observed in Ensembl/VEP are indel positioning and transcript choice.

In VCF, the left-most position is assigned to have been changed while in HGVS annotation, the most 3' position is assigned to have been changed. This means that to map indels in forward strand transcripts between the two systems, you have to take flanking sequence into account and shift the position accordingly. This can change the position, the alt string and the transcript consequence (exonic <-> intronic, etc)

Of course, different transcripts could give different annotation, so comparing HGVS to HGVS can also be problematic. The LRG project (http://www.lrg-sequence.org/) seeks to address this, but has far from complete coverage at the moment.

As of our latest release, we provide a VEP option to output 3' shifted HGVS annotation and we plan to provide shifted annotation as default in the browser in our next release.

HGVS annotation would not fit well in the VCF schema - this represents the variant discovery step and is not usually transcript aware. It is more suited to a variation annotation schema, as proposed by Pablo.

Best wishes,

Sarah

On 09/10/2014 06:44, Mike Lin wrote:

@pcingola https://github.com/pcingola Thanks - your conversation with the VEP team sounds like exactly the kind of information I was hoping might be illuminated. So I for one am in suspense :) To your second point, I agree but in fairness, I bet plenty of medical geneticists would have choice words about VCF, haha.

@lh3 https://github.com/lh3 That's certainly where I'd start as an NGS-biased individual, and (as you know better than I) there remain some subtleties in intersecting/merging records in the VCF data model. Others might do the exact opposite, as in the LOVD TR above. I wonder if anyone watching has experienced different approaches and learned some lessons about making the data most informative to support clinical interpretation.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/159#issuecomment-58465762.

mlin commented 9 years ago

Thank you! This is very informative. You're right about the schema, of course.

I thought this might be of interest - excerpt from a clinical report produced by a major CAP-accredited NGS interpretation lab:

There's an obvious issue here to your point about transcript choice (the raw data is provided separately, doubtful it's often consulted). Perhaps they ought to change this, perhaps they worry transcript IDs aren't stable, or perhaps they've concluded this is what's useful to the medical geneticist / genetic counselor...

mdrasmus commented 9 years ago

@mlin In terms of a "primary key" for alleles, we use (chrom, offset, ref, alt) as in the VCF standard. That means indels are left adjusted and 1bp padded on the left always.

One downside of this approach is that for large indels the primary key becomes unwieldy because of the large ref or alt sequence (note, you could get around a large ref by using start and stop instead of offset and ref). For one application, I just took the sha1 hash of the string chrom:offset:ref:alt in order to get an id of consistent size. I could then literally make that sha1 the primary of an allele in my database and do joins on it. Such an id is great for joining, but of course it is opaque and so it is not very useful for human readers by itself, but neither are dbSNP ids.

I think of HGVS names as a rendering of an allele, since there are so many possibilities and achieving fully normalized HGVS names is really hard. Full HGVS rendering is especially hard if one uses RefSeq transcripts (the standard in clinical applications), because liftOver (aka clinical remapping) is a subproblem due to indel and allele differences with the reference genome hg19/hg38. Clinical PDF reports are in some sense a rendering. Hopefully one would include the transcript id so that the name is at least unambiguous (if not unique).

brendanofallon commented 9 years ago

It's probably worth noting that for some 'variants' in HGVS there's not a simple VCF representation. For instance, c.[76A>C];[426G>T] is valid HGVS that denotes two separate variants. One could create separate VCF entries for each one, I suppose, but there's not a reliable way to then go back to the original HGVS. Is the hope to do this without relying on additional information stored in the INFO or FORMAT fields? Doing this using only chr, pos, ref and alt could be a tall order, at least for the more complex HGVS variations.

reece commented 9 years ago

I'm coming to the GA4GH party a bit late. Here are my two bits on some of the above.

Many transcripts (but small %) have indels with respect to the reference genome. This requires special mapping that doesn't assume contiguous sequences.
Even when the alignments are contiguous, mapping in the face of sequence non-identity is annoying. Imagine NC_123.4:5000A>T (ref A), aligning to NM_987.6 but with an A/C polymorphism at that position, leading to NM_987.6:c.22C>T. Not rocket science, but it means you need a large universe of alignments and sequences around.
Perhaps the most disconcerting is that NCBI and UCSC provide genomic exon structures for RefSeq NM accessions, but those coordinates are significantly different for ~886 transcripts. This has serious implications for literature reporting, clinical reporting, and assay design. See http://goo.gl/05sxpm, slide 13 on for more. (This is the reason our hgvs package depends on uta.)
Canonicalization of variants is challenging. (We essentially don't do this yet.)
HGVS recommendations are a moving target, as is the real-world use. IVS syntax comes to mind.
For the record, our hgvs package has some gaps (see paper and issue tracker for details). Patches are welcome!

richarddurbin commented 9 years ago

Sorry - this is very late. I thought I sent it a week or so ago, but just found it in my inbox.

I also see HGVS as a rendering, not a primary representation.

I am puzzled by the example from Brendan. Why is it not OK to have two separate VCF entries to represent the compound HGVS string that also denotes two variants? I am interested in why he says that there is not a reliable way to go back to the original HGVS.

After the San Diego meeting I feel I have greater clarity on the issues around the string record representations (HGVS and VCF) and the "graph" representations such as I proposed in https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/wip/variationReference.avdl. I realise that all of these are edit representations. For SNPs and indels the definition of a Variant in the variationReference API is a replacement by a literal string of material between two end points on the existing reference. This is close to Matt Rasmussen's suggestion of giving start and stop, which is good for large deletions, but no real help for large insertions (though it is hard to see how to specify these precisely without giving the inserted sequence). The same approach is more general though in seemlessly supporting inversions, translocations etc., which are more clumsy in VCF, running across multiple records. I guess HGVS also supports these.

In any case, with any edit-based representation we need to have a way to generate a canonical representation if we want to compare variants at the representation level, rather than require conversion to sequence and comparison at that level. (I presume we can take as given the goal that two variants are identical if and only if they result in the same sequence.) The left alignment rules for VCF are one way to do that. For graphs one can use de Bruijn exact k-mer matching, but this (a) collapses repeats, and (b) expands differences. (a) can be solved by requiring the context k-mer to be sufficiently long to be unique. Benedict Paten at al.'s mapping proposal aims to avoid (b) by the left:right idea. I am sorry, I don't know the equivalent rules for HGVS.

At the core, generating a canonical edit representation is dependent on an alignment process. We need to agree on a standard procedure for aligning new sequences to a reference, including I believe one that aligns to a reference including known variation.

Richard

On 27 Oct 2014, at 19:45, Brendan notifications@github.com wrote:

It's probably worth noting that for some 'variants' in HGVS there's not a simple VCF representation. For instance, c.[76A>C];[426G>T] is valid HGVS that denotes two separate variants. One could create separate VCF entries for each one, I suppose, but there's not a reliable way to then go back to the original HGVS. Is the hope to do this without relying on additional information stored in the INFO or FORMAT fields? Doing this using only chr, pos, ref and alt could be a tall order, at least for the more complex HGVS variations.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

haussler commented 9 years ago

(very strongly +1) !!

On Thu, Nov 6, 2014 at 3:08 AM, Richard Durbin notifications@github.com wrote:

Sorry - this is very late. I thought I sent it a week or so ago, but just found it in my inbox.

I also see HGVS as a rendering, not a primary representation.

I am puzzled by the example from Brendan. Why is it not OK to have two separate VCF entries to represent the compound HGVS string that also denotes two variants? I am interested in why he says that there is not a reliable way to go back to the original HGVS.

After the San Diego meeting I feel I have greater clarity on the issues around the string record representations (HGVS and VCF) and the "graph" representations such as I proposed in https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/wip/variationReference.avdl. I realise that all of these are edit representations. For SNPs and indels the definition of a Variant in the variationReference API is a replacement by a literal string of material between two end points on the existing reference. This is close to Matt Rasmussen's suggestion of giving start and stop, which is good for large deletions, but no real help for large insertions (though it is hard to see how to specify these precisely without giving the inserted sequence). The same approach is more general though in seemlessly supporting inversions, translocations etc., which are more clumsy in VCF, running across multiple records. I guess HGVS also supports these.

In any case, with any edit-based representation we need to have a way to generate a canonical representation if we want to compare variants at the representation level, rather than require conversion to sequence and comparison at that level. (I presume we can take as given the goal that two variants are identical if and only if they result in the same sequence.) The left alignment rules for VCF are one way to do that. For graphs one can use de Bruijn exact k-mer matching, but this (a) collapses repeats, and (b) expands differences. (a) can be solved by requiring the context k-mer to be sufficiently long to be unique. Benedict Paten at al.'s mapping proposal aims to avoid (b) by the left:right idea. I am sorry, I don't know the equivalent rules for HGVS.

At the core, generating a canonical edit representation is dependent on an alignment process. We need to agree on a standard procedure for aligning new sequences to a reference, including I believe one that aligns to a reference including known variation.

Richard

On 27 Oct 2014, at 19:45, Brendan notifications@github.com wrote:

It's probably worth noting that for some 'variants' in HGVS there's not a simple VCF representation. For instance, c.[76A>C];[426G>T] is valid HGVS that denotes two separate variants. One could create separate VCF entries for each one, I suppose, but there's not a reliable way to then go back to the original HGVS. Is the hope to do this without relying on additional information stored in the INFO or FORMAT fields? Doing this using only chr, pos, ref and alt could be a tall order, at least for the more complex HGVS variations.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/159#issuecomment-61964196.

bioinformed commented 9 years ago

Hi Richard,

I agree with almost everything you wrote except possibly your conclusion. I suspect that is because I see variant comparison as a non-boolean function that can result in exact equivalence, compatibility, non-equivalence, and potentially non-informative states. If phase was known with certainty, then canonicalization of edit-based variant representations under equivalent alignment models could be possible. Without phase certainty, exact equivalence is much harder to achieve (using short reads) and compatibility is often the more meaningful result. i.e. in practice, failure to match two variants exactly with identical alleles and non-identical yet compatible phase constraints is often a false negative. More formally, I can prove that there is no succinct canonical edit representation for some complex variants, even under the same alignment model. In these cases, comparison can be formulated as solving a non-trivial subgraph isomorphism problem between phase-constrained variant graphs. This graph comparison is conceptually equivalent to performing ploidy, zygosity and phase-aware sequence comparison, though hopefully avoiding materializing an exponential number of possible genotypes.

I'm writing up these ideas more formally and will share them with the group.

-Kevin

On Thu, Nov 6, 2014 at 4:08 AM, Richard Durbin notifications@github.com wrote:

Sorry - this is very late. I thought I sent it a week or so ago, but just found it in my inbox.

I also see HGVS as a rendering, not a primary representation.

I am puzzled by the example from Brendan. Why is it not OK to have two separate VCF entries to represent the compound HGVS string that also denotes two variants? I am interested in why he says that there is not a reliable way to go back to the original HGVS.

After the San Diego meeting I feel I have greater clarity on the issues around the string record representations (HGVS and VCF) and the "graph" representations such as I proposed in https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/wip/variationReference.avdl. I realise that all of these are edit representations. For SNPs and indels the definition of a Variant in the variationReference API is a replacement by a literal string of material between two end points on the existing reference. This is close to Matt Rasmussen's suggestion of giving start and stop, which is good for large deletions, but no real help for large insertions (though it is hard to see how to specify these precisely without giving the inserted sequence). The same approach is more general though in seemlessly supporting inversions, translocations etc., which are more clumsy in VCF, running across multiple records. I guess HGVS also supports these.

In any case, with any edit-based representation we need to have a way to generate a canonical representation if we want to compare variants at the representation level, rather than require conversion to sequence and comparison at that level. (I presume we can take as given the goal that two variants are identical if and only if they result in the same sequence.) The left alignment rules for VCF are one way to do that. For graphs one can use de Bruijn exact k-mer matching, but this (a) collapses repeats, and (b) expands differences. (a) can be solved by requiring the context k-mer to be sufficiently long to be unique. Benedict Paten at al.'s mapping proposal aims to avoid (b) by the left:right idea. I am sorry, I don't know the equivalent rules for HGVS.

At the core, generating a canonical edit representation is dependent on an alignment process. We need to agree on a standard procedure for aligning new sequences to a reference, including I believe one that aligns to a reference including known variation.

Richard

On 27 Oct 2014, at 19:45, Brendan notifications@github.com wrote:

It's probably worth noting that for some 'variants' in HGVS there's not a simple VCF representation. For instance, c.[76A>C];[426G>T] is valid HGVS that denotes two separate variants. One could create separate VCF entries for each one, I suppose, but there's not a reliable way to then go back to the original HGVS. Is the hope to do this without relying on additional information stored in the INFO or FORMAT fields? Doing this using only chr, pos, ref and alt could be a tall order, at least for the more complex HGVS variations.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/159#issuecomment-61964196.

haussler commented 9 years ago

What you say is true Kevin. You don't need to prove it to me, I've seen plenty of examples. But we have to beware of trying to "do it all at once". I have also been guilty of that in the past. So I think the first step is to formalize the "phase is known with certainty" case, and get some actual code pilots working that use that simpler model. We should certainly explicitly acknowledge this limitation, and leave in hooks to deal with the more complex cases of uncertain phasing, and then generalize to include the more complex cases asap. -D

On Thu, Nov 6, 2014 at 8:24 AM, Kevin Jacobs notifications@github.com wrote:

Hi Richard,

I agree with almost everything you wrote except possibly your conclusion. I suspect that is because I see variant comparison as a non-boolean function that can result in exact equivalence, compatibility, non-equivalence, and potentially non-informative states. If phase was known with certainty, then canonicalization of edit-based variant representations under equivalent alignment models could be possible. Without phase certainty, exact equivalence is much harder to achieve (using short reads) and compatibility is often the more meaningful result. i.e. in practice, failure to match two variants exactly with identical alleles and non-identical yet compatible phase constraints is often a false negative. More formally, I can prove that there is no succinct canonical edit representation for some complex variants, even under the same alignment model. In these cases, comparison can be formulated as solving a non-trivial subgraph isomorphism problem between phase-constrained variant graphs. This graph comparison is conceptually equivalent to performing ploidy, zygosity and phase-aware sequence comparison, though hopefully avoiding materializing an exponential number of possible genotypes.

I'm writing up these ideas more formally and will share them with the group.

-Kevin

On Thu, Nov 6, 2014 at 4:08 AM, Richard Durbin notifications@github.com wrote:

Sorry - this is very late. I thought I sent it a week or so ago, but just found it in my inbox.

I also see HGVS as a rendering, not a primary representation.

I am puzzled by the example from Brendan. Why is it not OK to have two separate VCF entries to represent the compound HGVS string that also denotes two variants? I am interested in why he says that there is not a reliable way to go back to the original HGVS.

After the San Diego meeting I feel I have greater clarity on the issues around the string record representations (HGVS and VCF) and the "graph" representations such as I proposed in

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/wip/variationReference.avdl.

I realise that all of these are edit representations. For SNPs and indels the definition of a Variant in the variationReference API is a replacement by a literal string of material between two end points on the existing reference. This is close to Matt Rasmussen's suggestion of giving start and stop, which is good for large deletions, but no real help for large insertions (though it is hard to see how to specify these precisely without giving the inserted sequence). The same approach is more general though in seemlessly supporting inversions, translocations etc., which are more clumsy in VCF, running across multiple records. I guess HGVS also supports these.

In any case, with any edit-based representation we need to have a way to generate a canonical representation if we want to compare variants at the representation level, rather than require conversion to sequence and comparison at that level. (I presume we can take as given the goal that two variants are identical if and only if they result in the same sequence.) The left alignment rules for VCF are one way to do that. For graphs one can use de Bruijn exact k-mer matching, but this (a) collapses repeats, and (b) expands differences. (a) can be solved by requiring the context k-mer to be sufficiently long to be unique. Benedict Paten at al.'s mapping proposal aims to avoid (b) by the left:right idea. I am sorry, I don't know the equivalent rules for HGVS.

At the core, generating a canonical edit representation is dependent on an alignment process. We need to agree on a standard procedure for aligning new sequences to a reference, including I believe one that aligns to a reference including known variation.

Richard

On 27 Oct 2014, at 19:45, Brendan notifications@github.com wrote:

It's probably worth noting that for some 'variants' in HGVS there's not a simple VCF representation. For instance, c.[76A>C];[426G>T] is valid HGVS that denotes two separate variants. One could create separate VCF entries for each one, I suppose, but there's not a reliable way to then go back to the original HGVS. Is the hope to do this without relying on additional information stored in the INFO or FORMAT fields? Doing this using only chr, pos, ref and alt could be a tall order, at least for the more complex HGVS variations.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/159#issuecomment-61964196.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/159#issuecomment-62005385.

pcingola commented 9 years ago

+1 "HGVS as a rendering"

bioinformed commented 9 years ago

@haussler: Keeping things simple is my guiding principle here. I have kept a dozen embellishments and complexities out of my initial scope in order to come up with a compelling solution that can be later extended. In particular, I'm hoping to find an approach that is alignment model agnostic, which seems to be complicating some of the approaches in the "phase known" models. I realize the perils of over-reach, but I hope to have practical results for comment shortly.

On Thu, Nov 6, 2014 at 12:12 PM, haussler notifications@github.com wrote:

What you say is true Kevin. You don't need to prove it to me, I've seen plenty of examples. But we have to beware of trying to "do it all at once". I have also been guilty of that in the past. So I think the first step is to formalize the "phase is known with certainty" case, and get some actual code pilots working that use that simpler model. We should certainly explicitly acknowledge this limitation, and leave in hooks to deal with the more complex cases of uncertain phasing, and then generalize to include the more complex cases asap. -D

On Thu, Nov 6, 2014 at 8:24 AM, Kevin Jacobs notifications@github.com wrote:

Hi Richard,

I agree with almost everything you wrote except possibly your conclusion. I suspect that is because I see variant comparison as a non-boolean function that can result in exact equivalence, compatibility, non-equivalence, and potentially non-informative states. If phase was known with certainty, then canonicalization of edit-based variant representations under equivalent alignment models could be possible. Without phase certainty, exact equivalence is much harder to achieve (using short reads) and compatibility is often the more meaningful result. i.e. in practice, failure to match two variants exactly with identical alleles and non-identical yet compatible phase constraints is often a false negative. More formally, I can prove that there is no succinct canonical edit representation for some complex variants, even under the same alignment model. In these cases, comparison can be formulated as solving a non-trivial subgraph isomorphism problem between phase-constrained variant graphs. This graph comparison is conceptually equivalent to performing ploidy, zygosity and phase-aware sequence comparison, though hopefully avoiding materializing an exponential number of possible genotypes.

I'm writing up these ideas more formally and will share them with the group.

-Kevin

On Thu, Nov 6, 2014 at 4:08 AM, Richard Durbin notifications@github.com

wrote:

Sorry - this is very late. I thought I sent it a week or so ago, but just found it in my inbox.

I also see HGVS as a rendering, not a primary representation.

I am puzzled by the example from Brendan. Why is it not OK to have two separate VCF entries to represent the compound HGVS string that also denotes two variants? I am interested in why he says that there is not a reliable way to go back to the original HGVS.

After the San Diego meeting I feel I have greater clarity on the issues around the string record representations (HGVS and VCF) and the "graph" representations such as I proposed in

https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/wip/variationReference.avdl.

I realise that all of these are edit representations. For SNPs and indels the definition of a Variant in the variationReference API is a replacement by a literal string of material between two end points on the existing reference. This is close to Matt Rasmussen's suggestion of giving start and stop, which is good for large deletions, but no real help for large insertions (though it is hard to see how to specify these precisely without giving the inserted sequence). The same approach is more general though in seemlessly supporting inversions, translocations etc., which are more clumsy in VCF, running across multiple records. I guess HGVS also supports these.

In any case, with any edit-based representation we need to have a way to generate a canonical representation if we want to compare variants at the representation level, rather than require conversion to sequence and comparison at that level. (I presume we can take as given the goal that two variants are identical if and only if they result in the same sequence.) The left alignment rules for VCF are one way to do that. For graphs one can use de Bruijn exact k-mer matching, but this (a) collapses repeats, and (b) expands differences. (a) can be solved by requiring the context k-mer to be sufficiently long to be unique. Benedict Paten at al.'s mapping proposal aims to avoid (b) by the left:right idea. I am sorry, I don't know the equivalent rules for HGVS.

At the core, generating a canonical edit representation is dependent on an alignment process. We need to agree on a standard procedure for aligning new sequences to a reference, including I believe one that aligns to a reference including known variation.

Richard

On 27 Oct 2014, at 19:45, Brendan notifications@github.com wrote:

It's probably worth noting that for some 'variants' in HGVS there's not a simple VCF representation. For instance, c.[76A>C];[426G>T] is valid HGVS that denotes two separate variants. One could create separate VCF entries for each one, I suppose, but there's not a reliable way to then go back to the original HGVS. Is the hope to do this without relying on additional information stored in the INFO or FORMAT fields? Doing this using only chr, pos, ref and alt could be a tall order, at least for the more complex HGVS variations.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/159#issuecomment-61964196.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/159#issuecomment-62005385.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/159#issuecomment-62014662.

awz commented 9 years ago

@haussler @richarddurbin @bioinformed Wholly agree with Kevin's comments here.

Really excited to see what Kevin produces!

lh3 commented 9 years ago

I guess @brendanofallon meant to show two phased variants distant from each other. If we put them into two VCF lines, we will have to use phase set. It is inconvenient if I have to say. We can use long REF/ALT alleles spanning the two SNPs, but it is a bit redundant.

On the representation of variants, the invariable is the sample sequence. If we want to answer a question like "is there a 3bp deletion at position 101-103bp?", we have to introduce alignment. I don't think there is a workaround unless we don't ask such edit-based questions any more.

My preference is to separate sequence calling and variant representation. We keep the actual sequence as the primary object and generate the edits/variants through a standardized alignment procedure (e.g. match=1, mismatch=-1, gapOpen0=-1 and gapExt=-1; left aligned; semi-global), either to a linear reference or to a graph. Some modern callers have already taken this approach. They first generate haplotypes and then determine the alleles (in contrast, old callers call alleles from edits).

In any case, with any edit-based representation we need to have a way to generate a canonical representation if we want to compare variants at the representation level, rather than require conversion to sequence and comparison at that level.

At the core, generating a canonical edit representation is dependent on an alignment process. We need to agree on a standard procedure for aligning new sequences to a reference, including I believe one that aligns to a reference including known variation.

I doubt a standard alignment procedure alone is sufficient. Even if we use the same alignment strategy, the alignment will vary with the input sequence. In rare cases, we might have to resort to sequence comparisons, I guess.

benedictpaten commented 9 years ago

In any case, with any edit-based representation we need to have a way to generate a canonical representation if we want to compare variants at the representation level, rather than require conversion to sequence and comparison at that level.

At the core, generating a canonical edit representation is dependent on an alignment process. We need to agree on a standard procedure for aligning new sequences to a reference, including I believe one that aligns to a reference including known variation.

I doubt a standard alignment procedure alone is sufficient. Even if we use the same alignment strategy, the alignment will vary with the input sequence. In rare cases, we might have to resort to sequence comparisons, I guess.

Hi Heng, Adam Novak, David H and I have put serious time into the notion of stable mapping. The basic idea is that if a position in an input sequence maps, no extension of that input sequence can make it map elsewhere. There is some subtly, but I think it will address your concerns? We're writing a paper on the subject, we'll forward to the list shortly.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/159#issuecomment-62075866.

pgrosu commented 9 years ago

Kevin - I was wondering if you might have an update for us? I would be very excited to read it.

Thanks, Paul

skeenan commented 9 years ago

This hasn't had a comment since November. I'm closing this in 2 days.

skeenan commented 9 years ago

Closing due to inactivity. This can be reopened if necessary.