KalinNonchev / gnomAD_DB

This package scales the huge gnomAD files to a SQLite database, which is easy and fast to query. It extracts from a gnomAD vcf the minor allele frequency for each variant.
MIT License
32 stars 10 forks source link

Gnomad v4 all populations #36

Closed SteampunkIslande closed 6 days ago

SteampunkIslande commented 2 weeks ago

Hi ! First of all, thanks for your awesome resource :)

I was able to download wes_gnomad data for version 4 in hg38, however I'd like to know why you didn't include all the populations in the database file:

https://github.com/KalinNonchev/gnomAD_DB/blob/a72b8fee7e9078c03d8d4b4e13341c03d401ead3/gnomad_db/pkgdata/gnomad_columns.yaml

I would like to have them all without downloading the original VCFs from gnomAD, not sure how I could do this.

Thank you for your time !

Charles

KalinNonchev commented 2 weeks ago

Hi @SteampunkIslande , thank you for your feedback.

The bottleneck here is that zenodo allows only 50G of upload. Thus, I can't include all populations in the preprocessed file.

Best, Kalin

SteampunkIslande commented 2 weeks ago

Thanks for your quick answer ! I might try processing my own files using the raw ones that are publicly available...

Also, I was able to turn 20GB sqlite3 database file for gnomad 4.1 hg38 into a 8GB parquet file, using this simple code snippet:

import argparse

import duckdb as db

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Convert SQLite to Parquet")
    parser.add_argument("sqlite", type=str, help="SQLite file")
    parser.add_argument("parquet", type=str, help="Parquet file")
    args = parser.parse_args()

    db.sql("INSTALL sqlite")
    db.sql("LOAD sqlite")
    db.sql(f"ATTACH '{args.sqlite}' AS db (TYPE sqlite)")
    db.sql("USE db")
    db.sql(f"COPY (SELECT * FROM db) TO '{args.parquet}' (FORMAT 'parquet')")

Just save it to sqlite2parquet.py and run it with:

python sqlite2parquet.py gnomad_wes_v4_hg38.sqlite3 gnomad_wes_v4_hg38.parquet

Querying such files is a breeze using duckdb, polars or any apache arrow library in your favorite language (even though python rules them all)