Closed SteampunkIslande closed 6 days ago
Hi @SteampunkIslande , thank you for your feedback.
The bottleneck here is that zenodo allows only 50G of upload. Thus, I can't include all populations in the preprocessed file.
Best, Kalin
Thanks for your quick answer ! I might try processing my own files using the raw ones that are publicly available...
Also, I was able to turn 20GB sqlite3 database file for gnomad 4.1 hg38 into a 8GB parquet file, using this simple code snippet:
import argparse
import duckdb as db
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Convert SQLite to Parquet")
parser.add_argument("sqlite", type=str, help="SQLite file")
parser.add_argument("parquet", type=str, help="Parquet file")
args = parser.parse_args()
db.sql("INSTALL sqlite")
db.sql("LOAD sqlite")
db.sql(f"ATTACH '{args.sqlite}' AS db (TYPE sqlite)")
db.sql("USE db")
db.sql(f"COPY (SELECT * FROM db) TO '{args.parquet}' (FORMAT 'parquet')")
Just save it to sqlite2parquet.py and run it with:
python sqlite2parquet.py gnomad_wes_v4_hg38.sqlite3 gnomad_wes_v4_hg38.parquet
Querying such files is a breeze using duckdb, polars or any apache arrow library in your favorite language (even though python rules them all)
Hi ! First of all, thanks for your awesome resource :)
I was able to download wes_gnomad data for version 4 in hg38, however I'd like to know why you didn't include all the populations in the database file:
https://github.com/KalinNonchev/gnomAD_DB/blob/a72b8fee7e9078c03d8d4b4e13341c03d401ead3/gnomad_db/pkgdata/gnomad_columns.yaml
I would like to have them all without downloading the original VCFs from gnomAD, not sure how I could do this.
Thank you for your time !
Charles