Open tnavatar opened 9 years ago
The Allele Registry should leverage the experience gained by ClinVar regarding the various minimal data sets of information needed to reasonably convey the allele/variant information. ClinVar has gained a wealth of experience accepting public submissions which should provide us with a sound approach for defining what is needed to register alleles/variants in the Allele Registry. We may need to provide additional choices based on differences in the Allele Model from the ClinVar model.
The ClinVar: Minimal Content paragraph notes
"The minimal data required for submission is ... _a valid variant description (either HGVS or genomic location and change)_ and ..."
Embedded in the ClinVar submission form the details for the HGVS or genomic location and change are specified more precisely as...
In the SubmissionTemplateLite.xlsx only the HGVS expression is specified as follows (with examples)
NOTE: our model does not currently support the cytogenetic representation mentioned in the comments.
However, in the SubmissionTemplate.xlsx both the HGVS and Genomic Coordinate types of submissions are specified. The HGVS expression is separated from the Reference Sequence it is based on in this version of the submission spreadsheet as seen here
ClinVar uses a 1-based numbering approach (like VCF).
While these are not the only two forms that may work, these are a good start to defining/specifying what would be required for submitting to the Allele Registry. All candidate forms of data sets for representing an allele can be offered, but each would need to be validated and specified to clarify exceptions and restrictions to their use.
Here are some examples of HGVS and Genome Coordinate style submissions with a draft set of rules.
Type | Ref Seq ID | HGVS | Build ID | Chromosome | Start | Stop | Ref Allele | Alt Allele |
---|---|---|---|---|---|---|---|---|
HGVS | NM_000059.3 | c.8969G>A | ||||||
HGVS | NP_000050.2 | p.Trp2990Ter | ||||||
GenCoord.1 | GRCh38 | 13 | 32953903 | 32953903 | G | A | ||
GenCoord.2 | NC_000016.10 | 2088236 | 2088237 | CA | - |
GenCoord.1 uses the Genome Build Assembly ID and Chromo# to determine the ref seq id for the coordinates and ref/alt alleles, while GenCoord.2 uses the Chromosome accession directly, thus not requiring the genome build assembly and chromo#. Thus, there are 2 forms of the GenCoord type of submission illustrated above (NOTE: ClinVar only specifies the GenCoord.1 style, but I think they will accept both in reality? - need to validate that).
Good to know; given just genomic location and change, do you know how they deal with things like different coordinate numbering systems? (0 vs 1) based?
On Jul 7, 2015, at 4:29 PM, Larry Babb notifications@github.com wrote:
The Allele Registry should leverage the experience gained by ClinVar regarding the various minimal data sets of information needed to reasonably convey the allele/variant information. ClinVar has gained a wealth of experience accepting public submissions which should provide us with a sound approach for defining what is needed to register alleles/variants in the Allele Registry. We may need to provide additional choices based on differences in the Allele Model from the ClinVar model.
The ClinVar: Minimal Content paragraph notes
"The minimal data required for submission is ... a valid variant description (either HGVS or genomic location and change) and ..."
Embedded in the ClinVar submission form the details for the HGVS or genomic location and change are specified more precisely as...
— Reply to this email directly or view it on GitHub.
All ClinVar submissions are based on 1-based coordinate numbering.
I added a commit (194f40ee00bedfcbd4fec277e6b0f38a3ae3803f) with a suggested change, feel free to modify or roll back as need be.
The change suggests requiring variants be submitted using one of a discrete list of known and (hopefully) unambiguous formats (like VCF, or our own Allele Model), rather than specifying the minimum number of fields. Presumably the system will also be able to respond to queries with variants formatted according to the same list, annotated with the system's accession for the variant.