KalinNonchev / gnomAD_DB

This package scales the huge gnomAD files to a SQLite database, which is easy and fast to query. It extracts from a gnomAD vcf the minor allele frequency for each variant.
MIT License
35 stars 10 forks source link

Limited fields in the gnomAD SQLite database #22

Closed brettChapman closed 11 months ago

brettChapman commented 11 months ago

Hi

I've started querying all the fields in gnomAD_DB using your preprocssed gnomAD SQLite v3.1.2. There appears to be only 22 columns in the resulting pandas dataframe, yet there are many more fields in the gnomAD VCF files. For example There is no 'AN_non_topmed_asj_XY' field, yet its in the VCF. Is there a reason why so many were left out?

Thanks.

brettChapman commented 11 months ago

If I wanted to include more, would I just need to generate my own SQLite DB from the raw VCF and modify the code, say in the YML file here? https://github.com/KalinNonchev/gnomAD_DB/blob/master/gnomad_db/pkgdata/gnomad_columns.yaml

brettChapman commented 11 months ago

If I were to update the YAML file with more fields, would the SQLite DB need creating again or would it work with the SQLite database you preprocessed earlier?

KalinNonchev commented 11 months ago

Hello @brettChapman,

re: Is there a reason why so many were left out?

Since I can upload up to 50 GB on zenodo, I had to preselect the most common annotations.

re: If I wanted to include more, would I just need to generate my own SQLite DB from the raw VCF and modify the code, say in the YML file here?

Yes, you should just modify the yaml file and include the columns you are interested in.

re: If I were to update the YAML file with more fields, would the SQLite DB need to be created again or would it work with the SQLite database you preprocessed earlier?

You would have to start from the beginning since you are going to update the database with new attributes.

KalinNonchev commented 11 months ago

Please let me know if you have further questions or comments. Best,

KalinNonchev commented 11 months ago

Please don't hesitate to reopen this GitHub issue if you have any more questions or need further assistance.