BRCAChallenge / brca-exchange

Overall management and deployment of the BRCA Exchange web portal and pipeline scripts
http://brcaexchange.org
28 stars 32 forks source link

Missing variant from gnomAD #1084

Open melissacline opened 5 years ago

melissacline commented 5 years ago

Here is a strange case of a missing gnomAD variant, 17-41228120-T-C. It appears in the gnomAD browser (https://gnomad.broadinstitute.org/variant/17-41228120-T-C?dataset=gnomad_r2_1_non_cancer), but not in BRCA Exchange. This variant is in 1000 Genomes, which was absorbed into ExAC, which in turn was absorbed into gnomAD. BRCA Exchange shows the 1000 Genomes variant (http://brcaexchange-dev.gi.ucsc.edu:9003/variant/220362) but does not show it being in ExAC, nor gnomAD.

Michael Parsons notes: I think this matching error with HGVS c. is the problem: if you look at BRCA1 c.4358-10C>T, there is data for ExAC but not gnomAD (and I have it in my gnomAD list with incorrect nomenclature from them), whereas c.528G>A it is matching fine. I’m pretty sure gnomAD mapped to the transcript including exon 13a, so everything before that will look fine, but everything after the middle of intron 13 will not match correctly.

In other words, for the gnomAD variants, the cDNA hgvs strings should not be trusted. The hg37 genomic coordinates can be trusted.

FWIW, the variant that Michael refers to (http://brcaexchange-dev.gi.ucsc.edu:9003/variant/229514) also has data from ExAC but not gnomAD, but seems to be a different variant than the other. This variant (the one with a report from ExAC) also shows a report from 1000 Genomes, the ExAC and 1000 Genomes reports show similar bar graphs, and the 1000 Genomes report for the other variant shows a completely different bar graph.

zfisch commented 5 years ago

@melissacline

We're actually not getting anything regarding 17-41228120-T-C from our API query. Maybe this is an issue with the query or with GnomAD itself?

Also, could you clarify this point?

In other words, for the gnomAD variants, the cDNA hgvs strings should not be trusted. The hg37 genomic coordinates can be trusted.

Do you mean the hgvsC values that GnomAD provides? We're actually using that value currently to construct the cDNA value and Genomic Coordinates we're storing in our database.

zfisch commented 5 years ago

If it's helpful, here's an example of the data we get for a given variant:

{
    u'xpos': 13032972957,
    u'hgvsp': None,
    u'chrom': u'13',
    u'consequence_in_canonical_transcript': True,
    u'pos': 32972957,
    u'hgvsc': u'c.*50A>G',
    u'rsid': u'rs761312704',
    u'flags': [],
    u'variantId': u'13-32972957-A-G',
    u'hgvs': u'c.*50A>G',
    u'consequence': u'3_prime_UTR_variant',
    u'exome':
        {
            u'ac': 0,
            u'af': 0,
            u'ac_hemi': 0,
            u'an': 222332,
            u'filters': [u'AC0', u'RF'],
            u'populations': [
                {u'ac_hom': 0, u'ac': 0, u'ac_hemi': 0, u'id': u'AFR', u'an': 13110},
                {u'ac_hom': 0, u'ac': 0, u'ac_hemi': 0, u'id': u'AMR', u'an': 32810},
                {u'ac_hom': 0, u'ac': 0, u'ac_hemi': 0, u'id': u'ASJ', u'an': 9204},
                {u'ac_hom': 0, u'ac': 0, u'ac_hemi': 0, u'id': u'EAS', u'an': 16960},
                {u'ac_hom': 0, u'ac': 0, u'ac_hemi': 0, u'id': u'FIN', u'an': 18522},
                {u'ac_hom': 0, u'ac': 0, u'ac_hemi': 0, u'id': u'NFE', u'an': 97582},
                {u'ac_hom': 0, u'ac': 0, u'ac_hemi': 0, u'id': u'OTH', u'an': 5296},
                {u'ac_hom': 0, u'ac': 0, u'ac_hemi': 0, u'id': u'SAS', u'an': 28848}
            ],
            u'ac_hom': 0
        },
    u'alt': u'G',
    u'ref': u'A',
    u'genome': None
}
melissacline commented 5 years ago

I'll be looking into this first thing on Monday morning...

melissacline commented 5 years ago

Oh gosh, this is a mess!

This file contains data for the BRCA1 variants that both BRCA Exchange and gnomAD say are in gnomAD. The BRCA Exchange data came from the dev sever, via the download link above the Variants table. I extracted the rows for which Variant_in_GnomAD is 't' and Gene_Symbol is BRCA1. From those rows, I extracted the Hg37_Start and HGVS_cDNA columns, and sorted them by hg37 start coordinate.

The gnomAD data came from the CSV link at https://gnomad.broadinstitute.org/gene/ENSG00000012048?dataset=gnomad_r2_1_non_cancer. I extracted the Pos and Transcript_conseq columns (which is usually the cDNA HGVS string, it turns out), and sorted them by hg37 position.

Then I did an sdiff of the two files. Here's the result. The four columns are:

  1. The genomic position of the variant in BRCA Exchange
  2. The nucleotide HGVS string in BRCA Exchange
  3. The genomic position of the variant in gnomAD
  4. The HGVS string in gnomAD.

bx.gnomad.pos.hgvs.sdiff.txt

When you look at the sdiff, you see many things: Line 1: all the data matches. Yay! Line 2: BRCA Exchange says it has variant c.65A>T at position 41197630. gnomAD has nothing at this position, it's next variant is at position 41197632. Line 3: In position 41197632, BRCA Exchange has variant c.63C>T. Looking back at Line 2 in gnomAD, it indicates that is has a variant at this position, but it doesn't say what it is. Annoying! Then there are a couple lines that match.
Then a short region where BRCA Exchange reports a number of variants, and gnomAD reports nothing. Then a couple lines of agreement, followed by a stretch of continuous disagreement.

Then finally at line 683, they start to agree again. This is about the edge of the region where Michael says he finds accurate translation of the gnomAD HGVS strings, BRCA1 c.4358-10C>T. The agreement continues, more or less, until Line 2294. (the "less" happens when the gnomAD data has an HGVS string that's blank or a protein HGVS, smaller issue). After Line 2294, it goes haywire again.

Looking at this image from the UCSC Genome Browser, the highlighted region shows where BRCA Exchange and gnomAD are showing consistent data. The RefSeq BRCA1 canonical transcript is the bottom-most in the RefSeq group. We can see that something went wrong with the first (right-most) exon, and then it corrected, and was fine until about Exon 13a, as Michael says. Exon 13A shows up as the small exon that's visible in a minority of the transcripts, just to the left of the shaded region. The gnomAD canonical transcript includes this exon. In whatever way gnomAD derived its HGVS annotations, it seems like they were broken by this exon (and something weird at the 5' end at the far right).

hgt_genome_46aa5_f9ea10.pdf

I could make some guesses as to what's gone wrong with the gnomAD alignments, but the bottom line is that it looks like their HGVS cDNA strings can't be trusted. It looks like the position can be trusted. In fact, I'd be inclined to trust their hg37 positions over ours: theirs show variants evenly scattered across the gene, while ours have a number of strange-looking duplications in the regions where we and gnomAD disagree (for instance, I see three variants allegedly occurring at Position 41277187 in our data).

Oh well, back to the drawing board!

melissacline commented 5 years ago

OK, I see what's going on. The data that comes over the API has HGVS cDNA strings that are consistent with the ENSEMBL canonical transcript, not the transcript that's used in the other data for BRCA1. So if you think of the mapping of HGVS cDNA strings to genomic position, the HGVS strings that we're currently getting from the API pertain to a different transcript, and wouldn't translate accurately to genomic positions in the way we're using them, because we'd need a difterent transcript as the reference. The easiest fix for that is to use that other transcript as reference when translating from gnomAD's cDNA to genomic coordinates.

melissacline commented 5 years ago

@zfisch are you using the hgvs library to translate from hgvs cDNA strings in the gnomAD API output? If so, then you just need to specify a different reference sequence for BRCA1 during the translation. Try ENST00000471181.7, and if that sequence isn't recognized by UTA, then NM_007300.3 should be recognized, and equivalent (I'm verifying that).

melissacline commented 5 years ago

Yes, those two sequences are equivalent. So if UTA doesn't recognize the ENST sequence, the NM_ sequence would be an accurate substitute.

melissacline commented 5 years ago

I've looked at the BRCA2 data, and it looks fine! So once we've gotten past this one translation issue (which I'm thinking should be an easy fix, now that we know we need to make it), we should be in good shape!

zfisch commented 5 years ago

I'll jump back to this after I have the donation work ready for review and report what I find :)