Denormalize table columns

david4096 commented 8 years ago

To handle positional searches, which are part of a GA4GH variants query, the reference name, start, and end positions from at least one reference should be available. The columns where these data are currently stored are named like Genomic_Coordinate_hg38 and appear as 1:123123.

Splitting these genomic coordinates into columns like Hg_38_Start, Hg_38_End, would allow GA4GH queries to be more easily satisfied. The same should be said for reference_bases and alternate_bases. Since the current coordinate column only provides the existing start, creating the end position (0-based exclusive) can be created from the length of the alternate bases.

https://github.com/BD2KGenomics/ga4gh-integration/issues/20

melissacline commented 8 years ago

Joe, could you add these new columns to the database on beta? They don't need to be added to the display. Come find me if you need any clarification on deriving the content.

strbean commented 8 years ago

To check if I'm following...

Given Genomic_Coordinate_hg38: chr13:g.32388063:C>T

I have Hg_38_Start: chr13:g.32388063 and Hg_38_End: chr13:g.32388064.

If I have Genomic_Coordinate_hg38: chr13:g.32388063:C>TTT, I would have

Hg_38_Start: chr13:g.32388063 and Hg_38_End: chr13:g.32388066.

Look right?

maryjgoldman commented 8 years ago

We should be careful to note if coordinates are open or closed. Both for GA4GH and for us. May make us 1 base off if we're not careful.

melissacline commented 8 years ago

Excellent point! The coordinates we have now are copen coordinates. For example, for the variant chr17:g.43127866:GAC>G (representing the deletion of the AC), the A begins at genomic position 43127866 and the C begins at 43127867. The base beginning at genomic position 43127868 is an A, which is not involved in this deletion. I've verified this against the UCSC Genome Browser, for which the first base in the chromosome begins at Coordinate 0.

On Thu, Jul 21, 2016 at 2:42 PM, maryjgoldman notifications@github.com wrote:

We should be careful to note if coordinates are open or closed. Both for GA4GH and for us. May make us 1 base off if we're not careful.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BD2KGenomics/brca-website/issues/120#issuecomment-234393150, or mute the thread https://github.com/notifications/unsubscribe-auth/ABoqh_0amqk8Y2vULS-bCG0t5IhkmHLsks5qX-fBgaJpZM4JIXeZ .

melissacline commented 8 years ago

Scratch that point! David S and I just verified that the coordinate is 1-base.

david4096 commented 8 years ago

It's a good point for us though, we need to subtract 1 to properly serve queries.

strbean commented 8 years ago

Alright, I'm trying to get this straight mentally.

For insertions and deletions, genomic nomenclature include the preceding base (GAC>G), so the Start coordinate is the coordinate of the A base, not the G? And the start coordinate is therefore the coordinate listed in the genomic nomenclature, +1?

For substitutions, there is no preceding base, so the coordinate listed is the same as the Start coordinate.

And we want an exclusive end, which would point to the nucleotide past the last one in our original sequence (the position after C in GAC>G)?

I've added the columns for start and end positions Hg_38_Start and Hg_38_End (and so forth for previous assemblies), but I think I'm going to need someone to walk me through an approach to populate the data from the genomic nomeclature entries.

On Thu, Jul 21, 2016 at 3:46 PM, David Steinberg notifications@github.com wrote:

It's a good point for us though, we need to subtract 1 to properly serve queries.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/BD2KGenomics/brca-website/issues/120#issuecomment-234406926, or mute the thread https://github.com/notifications/unsubscribe-auth/AFULxpT1vgkp3mAdilH9Tws5_-EcIAwfks5qX_a-gaJpZM4JIXeZ .

melissacline commented 8 years ago

Sadly,that's correct. The genomic coordinate in the HGVS string (which is what that strange notation is) refers to the start of the first base in the reference string, in a 1-based numbering system by which the first base in the genome is in Position 1.

When there's a single-base substitution, the number given refers to that single base, in 1-based numbering. When there's not a single-base substitution, it's actually pretty nonstandard what will happen. I suspect the best thing is for us to give you a new input file with the start and end positions, as they're easier to compute with our python libraries.

david4096 commented 8 years ago

Hi @strbean, thanks for starting this! The idea was to have columns like this:

Genomic_Coordinate_hg38: chr13:g.32388063:C>T becomes three columns:

hg38_reference_name: chr13, hg38_start: 32388062, hg38_end: 32388063. Note I have subtracted one from the position so the start positions are in the GA4GH counting mode. However, we will happily apply a transformation if you choose to store in a 1-based regime.

Having these three columns for each reference allows us to serve the variants in separate variant sets, mapped to each.

BD2KGenomics / brca-website-deprecated

Denormalize table columns #120