KalinNonchev / gnomAD_DB

This package scales the huge gnomAD files to a SQLite database, which is easy and fast to query. It extracts from a gnomAD vcf the minor allele frequency for each variant.
MIT License
35 stars 10 forks source link

gnomad v2.1.1 wgs+wes snp list #12

Closed thchen86 closed 3 years ago

thchen86 commented 3 years ago

Thanks for wrapping up the data tables, they were very helpful. Do you happen to have a table with a similar format, but contains gnomad v2.1.1 WGS+WES snps? Also, if the gene symbol can be added for each snp, that would be great as well.

Thanks, Devin

KalinNonchev commented 3 years ago

Do you happen to have a table with a similar format, but contains gnomad v2.1.1 WGS+WES snps?

Yes, I am planning to precompute tables for WES v2.1.1 and v3.1.1 in the near future. However, combining them into a single WES+WGS would not be optimal because I want to keep the gnomAD data as raw as possible to be FAIR. If you need the tables ASAP, you can run the pipeline with the WES gnomAD vcf which will create the table.

Also, if the gene symbol can be added for each snp, that would be great as well.

No, this will introduce redundancy and inaccuracies in the data, because a single variant could align to multiple overlapping genes. Another point is that there are always differences between gencode, ensemble GTF files and transcript regions. So you can just get your favourite GTF file (I recommend using the ensemble version) and extract the variants alone like:

from gnomad_db.database import gnomAD_DB

# pass dir
database_location = "test_dir"
db = gnomAD_DB(database_location)
db.get_mafs_for_interval(chrom=exon_chrom, interval_start=exon_start, interval_end=exon_end, query="AF")

and then aggregate them per gene.

Please, let me know if you have further questions.