EBISPOT / goci

GWAS Catalog Ontology and Curation Infrastructure
Apache License 2.0
26 stars 19 forks source link

Ingest MVP Data #1425

Open jiyue1214 opened 2 months ago

jiyue1214 commented 2 months ago

MVP data: MVP is an ongoing prospective cohort study and mega‐biobank in the Department of Veterans Affairs Healthcare System designed to study genetic influences on health and disease among veterans. This is the accession to hold publicly available results.

It includes genome-wide associations for 2068 traits from 635,969 participants in the Department of Veterans Affairs Million Veteran Program, a longitudinal study of diverse United States Veterans.

Some useful links:

jiyue1214 commented 1 month ago

Start to download the files in the folder /hps/nobackup/parkinso/spot/gwas/scratch/MVP_data_from_dbGap

jiyue1214 commented 1 month ago

In progress: compare the md5sum downloaded files and md5sum provide on the FTP

jiyue1214 commented 1 month ago

There are 4/34 tar files has different md5 and need to be re-downloaded. phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_DigestiveSystem_batch1.analysis-PI.MULTI.tar (206G - FTP:192G) phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_EndocrineMetabolic_batch1.analysis-PI.MULTI.tar (140G - FTP:140G), phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_EndocrineMetabolic_batch2.analysis-PI.MULTI.tar (157G-FTP:157G ) phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_SenseOrgans_batch2.analysis-PI.MULTI.tar (151G- FTP:151G)

jiyue1214 commented 1 month ago
Second download are correct: md5_second_download file_name md5 from ftp
bf9d9a19d946032712374797d9835954 phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_SenseOrgans_batch2.analysis-PI.MULTI.tar bf9d9a19d946032712374797d9835954
1e65f85c474f38a9191c98b0012848b9 phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_DigestiveSystem_batch1.analysis-PI.MULTI.tar 1e65f85c474f38a9191c98b0012848b9
2c50f94f813c8f63f861e7da07f2d37f phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_EndocrineMetabolic_batch1.analysis-PI.MULTI.tar 2c50f94f813c8f63f861e7da07f2d37f
dbc686a27c00e1e02833d1c9435d3d46 phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_EndocrineMetabolic_batch2.analysis-PI.MULTI.tar dbc686a27c00e1e02833d1c9435d3d46

Ready for the next step

jiyue1214 commented 3 weeks ago

There are three different data content and each type needs a corresponding JSON file : MVP_1: 2 SNP_ID chrom pos ref alt ea af num_samples beta sebeta pval q_pval i2 direction MVP_2: 836 SNP_ID chrom pos ref alt ea af num_samples beta sebeta pval r2 q_pval i2 direction MVP_3: 5184 SNP_ID chrom pos ref alt ea af num_samples case_af num_cases control_af num_controls or ci pval r2 q_pval i2 direction

Example of each type: MVP_1: MVP_R4.1000G_AGR.GIA.Labs_batch1/MVP_R4.1000G_AGR.Albumin_Mean_INT.META.GIA.dbGaP.txt.gz MVP_2: MVP_R4.1000G_AGR.GIA.Labs_batch1/MVP_R4.1000G_AGR.A1C_Max_INT.AFR.GIA.dbGaP.txt.gz MVP_3: MVP_R4.1000G_AGR.GIA.PheCodes_CirculatorySystem_batch1/MVP_R4.1000G_AGR.Phe_394.AFR.GIA.dbGaP.txt.gz

MVP3 does not have standerror, and cannot pass validation. For the MVP3, we can submit via our system and they will be labelled as non-GWAS-SSF and need to change them as pre-GWAS-SSF later for harmonisation.

jiyue1214 commented 3 weeks ago

MVP data have another issue, some base_pair_location values are missing, however, they have rsid, which can still be harmonised, but will not pass the validation.

Start formatting the data.

ljwh2 commented 1 week ago

@Santhi1901 will do the curation for the paper

jiyue1214 commented 4 days ago

Here is the information Santhi curated: https://docs.google.com/spreadsheets/d/1Ny9krC0rKF7KacVzobA_MGXLuW-e6vwR/edit?gid=1487741722#gid=1487741722

Here is the data overlapped between Santhi curated from publication and yue downloaded from FTP: https://docs.google.com/spreadsheets/d/1as93o8CfnwDMrHbB1Vv6YvyRYOCaqA2xWj_Yp5NkX48/edit?gid=0#gid=0

There are 4332/6023 studies overlapped.

Each study on FTP also has a metadata.txt file; we need to work together to get this information.

Santhi1901 commented 2 days ago

@jiyue1214. I've now added the meta-analysis result, and it looks like all the MVP data were added. For ease of checking, I placed the information here: https://docs.google.com/spreadsheets/d/1bfqExnmb_qktA-o3lYu3BtXPPfYMSfEpAuHI2uEwFFo/edit?gid=2036037118#gid=2036037118.

Please check and let me know if anything needs to be changed.

I also need to request EFO for some traits that have matched the parent trait for now.