KalinNonchev / gnomAD_DB

This package scales the huge gnomAD files to a SQLite database, which is easy and fast to query. It extracts from a gnomAD vcf the minor allele frequency for each variant.
MIT License
35 stars 10 forks source link

Discrepancy between allele frequency column (af) and gnomad website #35

Closed shwivel closed 2 months ago

shwivel commented 2 months ago

I am attempting to use your sqlite table to join a list of variants and determine the allele frequency of each using the current version (for gnomad 4.1).

When one uses the gnomad website to look up a variant by its ID, they are presented with an allele frequency which is based on exomes plus genomes. The website allows you to optionally filter in on one or the other, but the default is to reflect the overall frequency, based on the total count and number in exome and genome data. That is what I would expect and, prior to using your sqlite database, was using the gnomad api to calculate the frequency based on the (allele count of exomes + genomes) / (allele number of exomes + genomes) which for reference a snippet of php is below for illustration:

    // get genome allele frequency (gaf) from response
    // must calculate composite of genome / exome counts later
    if (isset($response['data']['variant']['genome']['af'])) {
      $gaf = $response['data']['variant']['genome']['af'];
      $gac = $response['data']['variant']['genome']['ac'];
      $gan = $response['data']['variant']['genome']['an'];
    }
    else {
      $gaf = 0;
      $gac = 0;
      $gan = 0;
    }

    // get exome allele frequency (eaf) from resposne
    // must calculate composite of genome / exome counts later
    if (isset($response['data']['variant']['exome']['af'])) {
      $eaf = $response['data']['variant']['exome']['af'];
      $eac = $response['data']['variant']['exome']['ac'];
      $ean = $response['data']['variant']['exome']['an'];
    }
    else {
      $eaf = 0;
      $eac = 0;
      $ean = 0;
    }

    // calculate composite gnomad pop. frequency
    if ($gan + $ean == 0)
      $taf = 0;
    else
      $taf = ($gac + $eac) / ($gan + $ean);

The difference between the overall allele frequency and just the genome allele frequency can be significant. For example, consider variant ID 19-1037767-G-A. Your sqlite table shows an AF of 0.0511315 as determined by running:

select * from gnomad_db v where chrom = 19 and pos = 1037767 and ref = 'G' and alt = 'A';

However, if you look up this variant on the gnomad website (https://gnomad.broadinstitute.org/variant/19-1037767-G-A?dataset=gnomad_r4), the allele frequency displayed is 0.002785. Of note, there is a filter on the page "Include:" which has options "Exomes" and "Genomes". By default, both are selected. If you unselect Exomes, you get 0.05113. That frequency is over 18 times greater than the overall frequency, thus misleading I believe. Admittedly, I do not work in genetics, but if one were attempting to get some general idea as to how frequently the variant occurs in the general population, wouldn't you want to reflect data from both exome and genome sequencing? What would the reason be to filter the exome numbers out?

Would it be possible to add the exome numbers as additional columns so that the overall allele frequency can be identified? The gnomad api can be used to show this information, for example if you use this query:

query
{ variant(variantId: "19-1037767-G-A", dataset: gnomad_r4)
  { variantId rsid genome { af ac an }
    exome { af ac an }
    in_silico_predictors { id value }
    sortedTranscriptConsequences { transcript_id gene_symbol major_consequence hgvs hgvsc polyphen_prediction sift_prediction }
  }
}

At this URL: https://gnomad.broadinstitute.org/api

The output is:

{
  "data": {
    "variant": {
      "variantId": "19-1037767-G-A",
      "rsid": "rs78386506",
      "genome": {
        "af": 0.05113147370019744,
        "ac": 4040,
        "an": 79012
      },
      "exome": {
        "af": 0.00010927430652843933,
        "ac": 156,
        "an": 1427600
      },
... and other output not relevant ...

The overall allele frequency can be calculated in the manner shown earlier.

I appreciate your providing this resource and was hoping to use it to substitute my current need to use the gnomad api by looping through variants (much faster with a table having a primary key on chrom/pos/ref/alt) but cannot replicate my results on account of this issue.

Please let me know your thoughts. Thanks

KalinNonchev commented 2 months ago

Hello @shwivel ,

Thank you for your interest and your detailed explanation.

Probably, you have seen that there are both WGS and WES provided as SQLite databases here. This is the raw data that is provided on gnomAD for downloading.

You can download both WGS and WES gnomAD and query them at the same time so that you can calculate this shared score. It should be fast enough.

Please let me know if you have further questions.

Best,

shwivel commented 2 months ago

Sorry, when I was looking at the downloads list I must have missed the WGS/WES text in the filenames, I thought each subsequent list item was just a different gnomad version/release. Got it. Thanks!