basic standardisation - deletion alleles and start/stop coordinates

antbro commented 10 years ago

It seems there may be different views and practices regarding how we should specify deletion alleles ("^", "*", "_", "-",...) and start/stop coordinates (first base, base before, last base, base after). I suggest it may be good to review who uses which alternatives and why, and ideally settle on a GA4GH standard for these very basic items.

bioinformed commented 10 years ago

I believe strongly in showing coordinates to end users in a system that they understand, which often means supporting multiple coordinate schemes for different communities and standards (rather than trying to get everyone to agree on a single representation). However, I believe even more strongly that all "back end" coordinate APIs, internal storage formats and arithmetic should use 0-based half-open coordinates, also called interbase coordinates or UCSC coordinates. We can review HGVS and other end user syntaxes, but for the remainder of this post I'm going to address only the "back end" representation.

For completeness, here are all non-degenerate cases by variant type (assuming 0 <= a <= b <= chromosome length):

class	start	stop	ref len (stop - start)	alt length
SNV	a	a + 1	1	1
Insertion	a	a	0	> 0
Deletion	a	b	> 0	0
MNV1	a	b	> 1	> 0
MNV2	a	b	1	> 1

No special notation is needed for null alleles for insertions or deletions -- they are merely empty strings for reference or alternative, respectively. This convention avoids picking special characters, adding padding bases and a variety of other unnecessary complexities.

antbro commented 10 years ago

Coordinates: I agree the interbase system is attractive, and indeed was seeking views/discussion about alternative "back end" preferences and practices Null allele notation: empty strings are easily missed/lost in processing, so a notation character would be safer. The "-" character seems to be the most widely used Perhaps its all simple then - these could be the convention for GA4GH APIs? [or perhaps de facto already are??]

jeromekelleher commented 10 years ago

I'm absolutely in favour of standardisation, and I like the ideas expressed above. However, I'd be strongly against using a "-" or anything else in the referenceBases and alternateBases fields to denote a null allele; these fields should only include bases and not any encoded auxiliary information. If it's necessary, we should have an explicit field like isNullAllele or have an enum for the different types of alleles/variants.

richarddurbin commented 10 years ago

I don't understand why we are discussing this. The current API is clear and derived from VCF, which is also a GA4GH specification. As Jerome says, it has no gap character, so no use of '-', '_', or '*'. Instead it uses pure replacement semantics, specifying a string in the reference which is to be replaced by a different string in the alternate. Coordinates are 0 based. This is robust, easy to parse, and has worked for tens of millions of variants in large scale sequencing projects. None of the proposed future graph representations include gap characters. All use 0-based coordinates.

Richard

On 28 Oct 2014, at 11:41, Jerome Kelleher notifications@github.com wrote:

I'm absolutely in favour of standardisation, and I like the ideas expressed above. However, I'd be strongly against using a "-" or anything else in the referenceBases and alternateBases fields to denote a null allele; these fields should only include bases and not any encoded auxiliary information. If it's necessary, we should have an explicit field like isNullAllele or have an enum for the different types of alleles/variants.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

antbro commented 10 years ago

Hi Richard - I triggered this thread after I saw the topic flagged up in the MME API, and given that others I spoke with were unsure of the situation, felt there may not be a consensus yet, and suggested I posed the question more widely to the group. I am glad it was so straightforward to get the answers (which it seems everyone agrees with). Beyond the backend processing we've been discussing, several different systems are in use for human readable data. I wonder if it would it make sense to issue a GA4GH convention on these matters, to guide newcomers who may be creating GUIs for such purposes?

sdumitriu commented 10 years ago

One issue that we've encountered is that some tools (if I remember correctly, JAnnovar and/or Exomiser) do output variants without a common prefix for insertions-deletions, which, although not valid according to the VCF specification, we'd still like to be able to process and expose through GA4GH APIs. Getting back the prefix is a task that would affect the performance.

lh3 commented 10 years ago

In addition to VCF, the other widely used mutation annotation system is HGVS which was discussed in #159. Tools should stop inventing new in-house representations of INDELs. I wouldn't mind if GA4GH ignores a few tools that do not conform to standards. The mainstream annotators all support VCFs.

bioinformed commented 10 years ago

I'm generally not in favor of supporting broken implementations of standards. However, I believe that any sane VCF implementation should strip leading and trailing reference bases added for padding as a recommended normalization step. The avoidance of empty alleles in the VCF spec, in my view, is a concession to end user formatting that only creates problems when processing any sufficiently complex VCF data. I'm happy to provide examples, but this issue may not be the right place to dive down this particular rabbit hole. Standards compliant VCF writers will have to re-insert the padding, which can generally be cached to avoid most reference lookups.

On Tue, Oct 28, 2014 at 8:57 AM, Sergiu Dumitriu notifications@github.com wrote:

One issue that we've encountered is that some tools (if I remember correctly, JAnnovar and/or Exomiser) do output variants without a common prefix for insertions-deletions, which, although not valid according to the VCF specification, we'd still like to be able to process and expose through GA4GH APIs. Getting back the prefix is a task that would affect the performance.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/168#issuecomment-60750550.

antbro commented 10 years ago

IMHO, the HGVS standard is a bit strange in the way it handles indels. And FYI, HGVS and ISCN are now attempting to align their respective nomenclature systems.

HGVS nomenclature uses 'start' and 'end/stop' as follows: They number bases not junctions between bases For insertions 'start' and 'end/stop' bases are those BETWEEN which the insertion takes place. For deletions 'start' and 'end/stop' bases are INCLUDED in the deletion.

bioinformed commented 10 years ago

Re HGVS: don't forget (tandem) duplications provide the coordinates of the duplicated sequence.

On Tue, Oct 28, 2014 at 10:32 AM, antbro notifications@github.com wrote:

IMHO, the HGVS standard is a bit strange in the way it handles indels. And FYI, HGVS and ISCN are now attempting to align their respective nomenclature systems.

HGVS nomenclature uses 'start' and 'end/stop' as follows: They number bases not junctions between bases For insertions 'start' and 'end/stop' bases are those BETWEEN which the insertion takes place. For deletions 'start' and 'end/stop' bases are INCLUDED in the deletion.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/168#issuecomment-60764451.

antbro commented 10 years ago

And HGVS nomenclature defines duplications as an entity that's separate from the notion of insertion

lh3 commented 10 years ago

The avoidance of empty alleles in the VCF spec, in my view, is a concession to end user formatting that only creates problems when processing any sufficiently complex VCF data.

Do you have examples (in addition to insertions at the beginning of chromosomes)?

IMHO, the HGVS standard ...

We should move HGVS discussions to #159.

pgrosu commented 10 years ago

So in programming languages, the less the number of reserved words, the more systematic the programming language was to utilize in creating and utilizing more complex structures. Maybe we can start having a small set of atomic operations and from those build up all the variations we require. Otherwise we're too general, or cannot encompass the possibilities that others might deem important.

bioinformed commented 10 years ago

Do you have examples (in addition to insertions at the beginning of chromosomes)?

Here is one example:

CHROM	POS	REF	ALT	[…]	S1
Z	1	A	T	...	1/1
Z	1	A	AG	...	0/1

There are two problems:

The left padding creates overlapping (yet valid) records. Many VCF processing tools will handle these records incorrectly.
The second record could be interpreted as making a reference assertion at position 1, which would be incorrect for sample S1 who is homozygous for an alternative allele at position 1.

haussler commented 10 years ago

I think we have to distinguish between what we have settled on as a standard abstract machine-readable representation scheme, which is as Richard describes, and a widely used human-readable text notation, the one getting the most attention from GA4GH being the HGVS nomenclature. A few of us attended the HGVS nomenclature meeting at ASHG last week, including the speaker Johan den Dunnen and lead developer Peter Taschner, cced. The perspective we came to is simply that computers and people respond best to different data representations. The best way forward is to define a standard abstract machine-readable representation scheme for computers and build tools to translate back and forth between that and a widely used human-readable text notation like HGVS. In the process of creating these tools and establishing that they are semantically consistent (in main part by actually encoding "hard to represent" genetic changes given by actual genetic examples), we will learn a lot.

We further agreed at the GA4GH meeting that people would send these "hard to represent" genetic changes to Kevin Jacobs, whose email I don't have handy (ccing Justin Zook for this). Let's contact Kevin and see if he has received anything, and if not, ping folks. -D

PS: One concrete test we discussed is

Taking a reference DNA sequence R and a DNA sequence A that is alternative version R, and representing the variants in A relative to R as a set V of changes.
Then taking V and using it to convert R into an alternate DNA sequence, which should be A

Simple, but important to check that this works in all cases. Similar tests can include translation from one format to the other, etc.

On Tue, Oct 28, 2014 at 7:42 AM, antbro notifications@github.com wrote:

And HGVS nomenclature defines duplications as an entity that's separate from the notion of insertion

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/168#issuecomment-60766174.

bioinformed commented 10 years ago

Let's contact Kevin and see if he has received anything, and if not, ping folks. -D

Hi, David. I'm here and participating! My email address is jacobs@bioinformed.com.

Justin, Benedict, and Steve have come up with a Google form to collect VCF comparison oddballs and hairballs that populates this spreadsheet. The plan is to start advertising this resource on the next WG call.

cassiedoll commented 10 years ago

Can someone make a pull request that enshrines these conventions for start/end/ref/alt in the comments of our avro file?

Then we'll have a good place to point anyone who has questions.

lh3 commented 10 years ago

@bioinformed Your example shows two overlapping variants. The preferred VCF should have one line with REF=A, ALT=T,TG and GT=1/2 instead. Inconsistencies between overlapping variants are inherent to all edit-based approaches. Stripping the prefix happens to remove the overlap in your example, but it does not solve the general problem. IMO, a well-formed VCF should contain no overlapping records. This is true for HaplotypeCaller and freebayes, I believe. In all, I don't think prefix is to blame here.

@cassiedoll Personally, I'd prefer to just state clearly what we have in GA4GH/VCF. Explaining alternatives would be very lengthy and might confuse less experienced users. Except HGVS, these alternatives are largely in-house and much less used.

haussler commented 10 years ago

excellent. Thanks Kevin! Can I make a request that we add to the Google form (or create a second Google form with) an option so that one can upload a set of DNA sequences that embodies a variant they think is hard to represent relative to a specified set of segments in a reference genome? This will help bring in more "hard to represent" examples. -D

PS: You could start with just uploading examples consisting of one alternate DNA segment relative to a given reference segment, but in cases like reciprocal translocations, you can't represent the changes to the two breakpoint regions using just one DNA segment. That said, if you want to start with just one reference DNA segment and one alternate DNA segment, that would be fine.

So great to get rolling on this! -D

On Tue, Oct 28, 2014 at 8:53 AM, bioinformed notifications@github.com wrote:

Let's contact Kevin and see if he has received anything, and if not, ping folks. -D

Hi, David. I'm here and participating! My email address is jacobs@bioinformed.com.

Justin, Benedict, and Steve have come up with a Google form https://docs.google.com/forms/d/1ou6Ozdc6M28gHSo-nn_XpHwbRspVRfhii-0MdJHt57w/viewform to collect VCF comparison oddballs and hairballs that populates this spreadsheet https://docs.google.com/spreadsheets/d/1FQfq6EGnNohjSa44Rgs2lmV9W8hij8X_Dj2l6mMFqPU/edit#gid=347931999. The plan is to start advertising this resource on the next WG call.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/168#issuecomment-60779054.

haussler commented 10 years ago

PPS: To input an example consisting of a heterozygous diploid variant relative to a single reference segment, two variant DNA sequences would be provided: one identical to the reference and one changed relative to the reference.

On Tue, Oct 28, 2014 at 9:52 AM, David Haussler haussler@soe.ucsc.edu wrote:

excellent. Thanks Kevin! Can I make a request that we add to the Google form (or create a second Google form with) an option so that one can upload a set of DNA sequences that embodies a variant they think is hard to represent relative to a specified set of segments in a reference genome? This will help bring in more "hard to represent" examples. -D

PS: You could start with just uploading examples consisting of one alternate DNA segment relative to a given reference segment, but in cases like reciprocal translocations, you can't represent the changes to the two breakpoint regions using just one DNA segment. That said, if you want to start with just one reference DNA segment and one alternate DNA segment, that would be fine.

So great to get rolling on this! -D

On Tue, Oct 28, 2014 at 8:53 AM, bioinformed notifications@github.com wrote:

Let's contact Kevin and see if he has received anything, and if not, ping folks. -D

Hi, David. I'm here and participating! My email address is jacobs@bioinformed.com.

Justin, Benedict, and Steve have come up with a Google form https://docs.google.com/forms/d/1ou6Ozdc6M28gHSo-nn_XpHwbRspVRfhii-0MdJHt57w/viewform to collect VCF comparison oddballs and hairballs that populates this spreadsheet https://docs.google.com/spreadsheets/d/1FQfq6EGnNohjSa44Rgs2lmV9W8hij8X_Dj2l6mMFqPU/edit#gid=347931999. The plan is to start advertising this resource on the next WG call.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/168#issuecomment-60779054.

antbro commented 10 years ago

"One concrete test we discussed is..." Please note - HGVS nomenclature includes synonyms, so perhaps include examples of such in this 'concrete test'. There are also many variants that cannot be represented in HGVS format. I'll chase down examples of these

cassiedoll commented 10 years ago

@lh3 - sorry for being unclear, that's what I meant. If you just look at the avro/docs, we don't say anything about how you represent an indel/deletion/etc - which just seems like an oversight :)

haussler commented 10 years ago

Yes. we discussed a multi-step test like this

input HGVS variant set V expressed relative to a given reference DNA sequence R
compute the corresponding alternate DNA sequence A
Given A and R, compute the corresponding HGVS canonical representation V' of the difference between A and R. Note that it is expected that V' may not equal V. Synonyms are allowed in HGVS. It is claimed that V' and V should be two equivalent representations. To further check this
Given reference R and set of HGVS changes V', compute the alternate DNA sequence A'. Veryify that A' = A.

-D

On Tue, Oct 28, 2014 at 9:58 AM, antbro notifications@github.com wrote:

"One concrete test we discussed is..." Please note - HGVS nomenclature includes synonyms, so perhaps include examples of such in this 'concrete test'. There are also many variants that cannot be represented in HGVS format. I'll chase down examples of these

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/168#issuecomment-60790919.

pgrosu commented 10 years ago

@cassiedoll - this is great! After the collection of examples, I think we can build a fundamental set of a atomic language constructs with simple rules to build up all these cases. As noted by @haussler, these can very easily form tests.

richarddurbin commented 10 years ago

I support @lh3 on this. I don't think that the so-called prefix in VCF creates problems - I actually see it as removing problems by providing very clean replacement semantics. Note that there is no problem with VCF representing an insertion at the start of the chromosome. You simply replace the first base by a string with the new sequence followed by the first base. e.g. an insertion of TT before start of chromosome base A is given by

CHR1    1   A   TTA

VCF as a format does not specify that the replaced base is at the start of the replacement string - that is just a convention to make the representation canonical.
This is why I wrote "so-called prefix" above. But it is clear that there is only one way to represent a start-of-chromosome insertion with a minimal replacement, so that must be canonical.

It is also true that the VCF model requires that overlapping variants are merged. This is what makes it messy to merge VCF files.
By the way, in the monoallelic representation we would have alleles "REF", "T then second base" and "TG then second base" and the individual would have allele count 0 for the REF allele and 1 for the other two alleles. There are no merging problems in the monoallelic representation.

Richard

On 28 Oct 2014, at 16:50, Heng Li notifications@github.com wrote:

@bioinformed Your example shows two overlapping variants. The preferred VCF should have one line with REF=A, ALT=T,TG and GT=1/2 instead. Inconsistencies between overlapping variants are inherent to all edit-based approaches. Stripping the prefix happens to remove the overlap in your example, but it does not solve the general problem. IMO, a well-formed VCF should contain no overlapping records. This is true for HaplotypeCaller and freebayes, I believe. In all, I don't think prefix is to blame here.

@cassiedoll Personally, I'd prefer to just state clearly what we have in GA4GH/VCF. Explaining alternatives would be very lengthy and might confuse less experienced users. Except HGVS, these alternatives are also much less used.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

lh3 commented 10 years ago

@cassiedoll I see what you mean now. The current description is avro is reasonably accurate and expressive about our intended representation. At least I do not have something to add for now.

pgrosu commented 10 years ago

@richarddurbin and @lh3 - so should we have a VCF checker that validates it before reading it into the schema?

richarddurbin commented 10 years ago

There is a VCF validator. I am copying Petr Danecek who should be able to point you to it.
This should be a GA4GH file formats tool. Maybe you can find it linked from the VCF specification page.

Richard

On 28 Oct 2014, at 17:29, Paul Grosu notifications@github.com wrote:

@richarddurbin and @lh3 - so should we have a VCF checker that validates it before reading it into the schema?

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

haussler commented 10 years ago

Good to have examples that cannot be represented in VCF or HGVS too. Thanks for collecting these!

As we discussed at the GA4GH meeting, it goes the other way too. Since they do not require full representation of phasing, in the diploid case VCF and HGVS unphased representations don't usually semantically correspond to a single pair of alternate DNA sequences relative to the reference. Semantically, they correspond to a set of possible alternate diploid configurations obtained by listing all possible phasings. Other issues can complicate this further. Since the set of possible "allowable" DNA interpretations of a VCF or HGVS file/string can be very large, Kevin has written code that samples it to test equivalence.

Kevin, it is your call how soon you want to get into this sampling part on the testing software end, but I would say that even if at the start we just explore a large set of simplified examples where there is just one DNA representation to consider when you are given a reference and a VCF or HGVS file, we would still learn a lot. -D

On Tue, Oct 28, 2014 at 9:58 AM, antbro notifications@github.com wrote:

"One concrete test we discussed is..." Please note - HGVS nomenclature includes synonyms, so perhaps include examples of such in this 'concrete test'. There are also many variants that cannot be represented in HGVS format. I'll chase down examples of these

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/168#issuecomment-60790919.

haussler commented 10 years ago

The sooner we start using and testing the monallelic representation alongside VCF and HGVS the better. -D

On Tue, Oct 28, 2014 at 10:23 AM, Richard Durbin notifications@github.com wrote:

I support @lh3 on this. I don't think that the so-called prefix in VCF creates problems - I actually see it as removing problems by providing very clean replacement semantics. Note that there is no problem with VCF representing an insertion at the start of the chromosome. You simply replace the first base by a string with the new sequence followed by the first base. e.g. an insertion of TT before start of chromosome base A is given by

CHR1 1 A TTA

VCF as a format does not specify that the replaced base is at the start of the replacement string - that is just a convention to make the representation canonical. This is why I wrote "so-called prefix" above. But it is clear that there is only one way to represent a start-of-chromosome insertion with a minimal replacement, so that must be canonical.

It is also true that the VCF model requires that overlapping variants are merged. This is what makes it messy to merge VCF files. By the way, in the monoallelic representation we would have alleles "REF", "T then second base" and "TG then second base" and the individual would have allele count 0 for the REF allele and 1 for the other two alleles. There are no merging problems in the monoallelic representation.

Richard

On 28 Oct 2014, at 16:50, Heng Li notifications@github.com wrote:

@bioinformed Your example shows two overlapping variants. The preferred VCF should have one line with REF=A, ALT=T,TG and GT=1/2 instead. Inconsistencies between overlapping variants are inherent to all edit-based approaches. Stripping the prefix happens to remove the overlap in your example, but it does not solve the general problem. IMO, a well-formed VCF should contain no overlapping records. This is true for HaplotypeCaller and freebayes, I believe. In all, I don't think prefix is to blame here.

@cassiedoll Personally, I'd prefer to just state clearly what we have in GA4GH/VCF. Explaining alternatives would be very lengthy and might confuse less experienced users. Except HGVS, these alternatives are also much less used.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/168#issuecomment-60795294.

bioinformed commented 10 years ago

@richarddurbin, @lh3: As I read it, the VCF spec does not require that overlapping records are merged. Do you mean something different by the "VCF model"?

Here is the only text I can find in the VCF 4.2 spec on the topic:

POS - position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS.

One nit in the wording: If multiple records can have the same POS then they're technically required to be in non-decreasing order. This also opens a loophole where there is no canonical total ordering for VCF records (modulo contig order).

That said, I agree that padding doesn't create a fundamental problem with the VCF model, per se. I still assert that any program attempting to interpret complex variation should strip extraneous reference bases when attempting to model complex variation to avoid making potentially invalid reference assertions. This is why I believe that APIs presenting models of variants based on a reference interval and substituted alleles should support both null reference ranges (insertions) and null alternative alleles (deletions).

@richarddurbin: I'm joining the conversation late and haven't been able to find the monoallelic representation proposal. Can you please point me in the right direction?

bioinformed commented 10 years ago

@haussler: Another of my projects is to validate that all HGVS representations in HGMD Pro and Clinvar match the VCF representation of the same variants and vice versa. I'm using my colleague Reece Hart's UTA and HGVS infrastructure. This round of testing doesn't include complex phased haplotypes, but that is the next logical step.

haussler commented 10 years ago

Peter can you share your tools for validating HGVS? I could not find them at http://www.hgvs.org/mutnomen/

On Tue, Oct 28, 2014 at 10:58 AM, Kevin Jacobs notifications@github.com wrote:

@haussler https://github.com/haussler: Another of my projects is to validate that all HGVS representations in HGMD Pro and Clinvar match the VCF representation of the same variants and vice versa. I'm using my colleague Reece Hart's UTA https://pypi.python.org/pypi/uta/0.1.8 and HGVS https://pypi.python.org/pypi/hgvs infrastructure. This round of testing doesn't include complex phased haplotypes, but that is the next logical step.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/168#issuecomment-60801343.

lh3 commented 10 years ago

@bioinformed VCF does not specify whether it allows overlapping variants. I was saying that overlapping variants may lead to inconsistencies no matter whether you have prefix or not. The problem is not prefix, so it is not necessary to strip it. Allowing null alleles will make it difficult to export the GA4GH representation to VCF and add unnecessary complexity to make analysis code work with both cases. I think for edit-based representation, GA4GH should just stick with VCF.

pgrosu commented 10 years ago

@richarddurbin - Thank you, yes, vcf-tools are very helpful - almost forgot about the validator :) I guess I intended say that before the "variant call" file gets loaded into our schema it performs checks and reports back recommendations. For instance - like @lh3 mentioned - if it finds overlapping variants, it will ask the user to reformat the file or it will provide recommendations. This step can have multiple validation stages, before it would go into a GA4GH repository. If so, based on the examples we collect here, we can form all the checks as part of GA4GH. I feel integrating already publicly available tools will speed up our process here.

pd3 commented 10 years ago

@pgrosu @richarddurbin The validator from vcf-tools may not be the best choice, it's written in perl and may be too slow for this purpose. I understand the motivation is to check overlapping variants here?

haussler commented 10 years ago

Dear David,

You can find the Mutalyzer suite at https://mutalyzer.nl. We have published an extended Backus-Naur form of the HGVS nomenclature syntax in 2011 in BMC Bioinformatics (doi: 10.1186/1471-2105-12-S4-S5). The EBNF is used to generate Mutalyzer’s HGVS syntax parser.

Best regards,

Peter

From: David Haussler [mailto:haussler@soe.ucsc.edu] Sent: dinsdag 28 oktober 2014 19:04 To: ga4gh/schemas Cc: ga4gh/schemas; Taschner, P.E.M. (HG) Subject: Re: [schemas] basic standardisation - deletion alleles and start/stop coordinates (#168)

Peter can you share your tools for validating HGVS? I could not find them at http://www.hgvs.org/mutnomen/

On Tue, Oct 28, 2014 at 10:58 AM, Kevin Jacobs notifications@github.com<mailto:notifications@github.com> wrote:

@hausslerhttps://github.com/haussler: Another of my projects is to validate that all HGVS representations in HGMD Pro and Clinvar match the VCF representation of the same variants and vice versa. I'm using my colleague Reece Hart's UTAhttps://pypi.python.org/pypi/uta/0.1.8 and HGVShttps://pypi.python.org/pypi/hgvs infrastructure. This round of testing doesn't include complex phased haplotypes, but that is the next logical step.

— Reply to this email directly or view it on GitHubhttps://github.com/ga4gh/schemas/issues/168#issuecomment-60801343.

pgrosu commented 10 years ago

@pd3, overlapping variants would be one case, which also got continued in another discussion in #169. But we are trying to brainstorm different "hard to represent" genetic changes that we might encounter and how to best handle them. The validation I initially brought up as a checkpoint to determine if such a file would contain genetic changes that might be better represented another way or if we want to perform specific checks before the data would go into a GA4GH repository, based on best-practices we intend for data-representation in our schema.

reece commented 10 years ago

I'm coming to the conversation late. A few comments on the thread:

Interbase is the only coordinate system being discussed that can represent all of the major edit types without corner cases. Although equal numerically to 0-based, right open, base coordinates, interbase is conceptually much cleaner. The biggest issue is that it forces implementations to think in terms of intervals throughout the code.
HGVS, a human readable syntax for variants, should be kept far away from the backend representation. I would not use it internally or for database representation. I personally think about this just like people think about utf-8 -- encode/decode at the IO boundaries and use unicode internally, everywhere.
I like @haussler's invertible operation demonstration of correctness. As David and I discussed at the HGVS meeting, the invertibility needs to be tested under an equivalence function that accounts for canonicalization. Also, I would think about this on the underlying representation (a graph, preferably) rather than in HGVS because some HGVS operations are lossy and therefore not invertible.
FWIW, the HGVS code (http://bitbucket.org/hgvs/hgvs) has an experimental script that adds an HGVS info field to a VCF.
I'll continue HGVS-specific comments in #159.

mlin commented 10 years ago

Just in case it may come in handy for anyone, here's a schematic by @asimenos we use to illustrate interbase coordinates (at least I hope we're referring to the same thing :) Taken from https://wiki.dnanexus.com/Types/gri

interbase

reece commented 10 years ago

In the same vein, here's a a Google spreadsheet that has a bunch of coordinate system and mapping examples that I put together when we were working on HGVS variant mapping. http://goo.gl/b1nUxl

skeenan commented 9 years ago

This issue has had a lot of discussion. It would be great to hear final comment on which standard GA4GH has settled on for deletion alleles and start/stop coordinates.

ga4gh / ga4gh-schemas

basic standardisation - deletion alleles and start/stop coordinates #168