KalinNonchev / gnomAD_DB

This package scales the huge gnomAD files to a SQLite database, which is easy and fast to query. It extracts from a gnomAD vcf the minor allele frequency for each variant.
MIT License
35 stars 10 forks source link

Load mitochondrial data #9

Closed bjmain closed 3 years ago

bjmain commented 3 years ago

Any chance you can update it to handle the mitochondrial data? https://gnomad.broadinstitute.org/downloads#v3-mitochondrial-dna

KalinNonchev commented 3 years ago

Hello @bjmain,

yes, this would be easy. The problem is that the mitochondrial data doesn't have AF%AF_afr%AF_eas%AF_fin%AF_nfe%AF_asj%AF_oth%AF_popmax columns, but AF_hom, AF_het, hap_AF_hom, hap_AF_het, hapmax_AF_hom, hapmax_AF_het, pop_AF_hom, pop_AF_het. Which columns would you need for your analysis? I would have to extend the database with these columns.

bjmain commented 3 years ago

Hi Kalin, I would think we would want these columns to be in the database (at minimum it would be pop_AF_hom and pop_AF_het pop_AN):

INFO=<ID=excluded_AC,Number=1,Type=Integer,Description="Excluded allele

count (number of individuals in which the variant was filtered out)">

INFO=<ID=AN,Number=1,Type=Integer,Description="Overall allele number

(number of samples with non-missing genotype)">

INFO=<ID=AC_hom,Number=1,Type=Integer,Description="Allele count

restricted to variants with a heteroplasmy level >= 0.95">

INFO=<ID=AC_het,Number=1,Type=Integer,Description="Allele count

restricted to variants with a heteroplasmy level >= 0.10 and < 0.95">

INFO=<ID=dp_mean,Number=1,Type=Float,Description="Mean depth across all

individuals for the site">

INFO=<ID=mq_mean,Number=1,Type=Float,Description="Mean MMQ (median

mapping quality) across individuals with a variant for the site">

INFO=<ID=tlod_mean,Number=1,Type=Float,Description="Mean TLOD (Log 10

likelihood ratio score of variant existing versus not existing) across individuals with a variant for the site">

INFO=<ID=pop_AN,Number=1,Type=String,Description="List of overall allele

number for each population, population order: ['afr', 'ami', 'amr', 'asj', 'eas', 'fin', 'nfe', 'oth', 'sas', 'mid']">

INFO=<ID=pop_AC_het,Number=1,Type=String,Description="List of AC_het for

each population, population order: ['afr', 'ami', 'amr', 'asj', 'eas', 'fin', 'nfe', 'oth', 'sas', 'mid']">

INFO=<ID=pop_AC_hom,Number=1,Type=String,Description="List of AC_hom for

each population, population order: ['afr', 'ami', 'amr', 'asj', 'eas', 'fin', 'nfe', 'oth', 'sas', 'mid']">

INFO=<ID=pop_AF_hom,Number=1,Type=String,Description="List of AF_hom for

each population, population order: ['afr', 'ami', 'amr', 'asj', 'eas', 'fin', 'nfe', 'oth', 'sas', 'mid']">

INFO=<ID=pop_AF_het,Number=1,Type=String,Description="List of AF_het for

each population, population order: ['afr', 'ami', 'amr', 'asj', 'eas', 'fin', 'nfe', 'oth', 'sas', 'mid']">

Best, Brad

On Tue, Jun 29, 2021 at 10:12 PM Kalin Nonchev @.***> wrote:

Hello @bjmain https://github.com/bjmain,

yes, this would be easy. The problem is that the mitochondrial data doesn't have AF%AF_afr%AF_eas%AF_fin%AF_nfe%AF_asj%AF_oth%AF_popmax columns, but AF_hom, AF_het, hap_AF_hom, hap_AF_het, hapmax_AF_hom, hapmax_AF_het, pop_AF_hom, pop_AF_het. Which columns would you need for your analysis? I would have to extend the database with these columns.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KalinNonchev/gnomAD_MAF/issues/9#issuecomment-871101314, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEVFLVFERBHI57CXT6BFWDTVKRTDANCNFSM47RI7RJQ .

-- Bradley Main, Ph.D Assistant Researcher Vector Genetics Lab 1089 Veterinary Medicine Dr., VM3B, UC Davis Davis, CA 95616 Cell: 925-330-0938

KalinNonchev commented 3 years ago

Can you give me more information about your use case and what you want to do? Do you use only the mitochondrial data? This would really help me to understand your problem and how we can solve it most efficiently and at a reasonable time.

KalinNonchev commented 3 years ago

This would solve your problem I think.

You can run bcftools from the command line as

bcftools query -f "%CHROM\t%POS\t%REF\t%ALT\t%AF_hom\t%AF_het\t%excluded_AC\t%AN\t%AC_hom\t%AC_het\t%dp_mean\t%mq_mean\t%tlod_mean\t%pop_AN\t%pop_AC_het\t%pop_AC_hom\t%pop_AF_hom\t%pop_AF_het\n" gnomad.genomes.v3.1.sites.chrM.vcf.bgz | gzip > gnomad.genomes.v3.1.sites.chrM.tsv.gz

and then you can load the table as

import pandas as pd

chrMgnomad = pd.read_csv("../test_dir/gnomad.genomes.v3.1.sites.chrM.tsv.gz", sep="\t", header=None)
chrMgnomad.columns = ["CHROM", "POS", "REF", "ALT", "AF_hom", "AF_het", "excluded_AC", "AN", "AC_hom", "AC_het", "dp_mean", "mq_mean", "tlod_mean", "pop_AN", "pop_AC_het", "pop_AC_hom", "pop_AF_hom", "pop_AF_het"]
chrMgnomad.head()

After that, you can merge on chrome, pos, ref and alt on your dataset. This would be super fast as you are loading everything in memory. The table contains also only 18 164 variants.

Let me know if this works for you or what would the problem be in your use case.

Best,

KalinNonchev commented 3 years ago

I am closing this issue as it is not active. Please feel free to reopen it, if the issue is not solved.