Open jiyue1214 opened 2 months ago
Start to download the files in the folder /hps/nobackup/parkinso/spot/gwas/scratch/MVP_data_from_dbGap
In progress: compare the md5sum downloaded files and md5sum provide on the FTP
There are 4/34 tar files has different md5 and need to be re-downloaded. phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_DigestiveSystem_batch1.analysis-PI.MULTI.tar (206G - FTP:192G) phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_EndocrineMetabolic_batch1.analysis-PI.MULTI.tar (140G - FTP:140G), phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_EndocrineMetabolic_batch2.analysis-PI.MULTI.tar (157G-FTP:157G ) phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_SenseOrgans_batch2.analysis-PI.MULTI.tar (151G- FTP:151G)
Second download are correct: | md5_second_download | file_name | md5 from ftp |
---|---|---|---|
bf9d9a19d946032712374797d9835954 | phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_SenseOrgans_batch2.analysis-PI.MULTI.tar | bf9d9a19d946032712374797d9835954 | |
1e65f85c474f38a9191c98b0012848b9 | phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_DigestiveSystem_batch1.analysis-PI.MULTI.tar | 1e65f85c474f38a9191c98b0012848b9 | |
2c50f94f813c8f63f861e7da07f2d37f | phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_EndocrineMetabolic_batch1.analysis-PI.MULTI.tar | 2c50f94f813c8f63f861e7da07f2d37f | |
dbc686a27c00e1e02833d1c9435d3d46 | phs002453.MVP_R4.1000G_AGR.GIA.PheCodes_EndocrineMetabolic_batch2.analysis-PI.MULTI.tar | dbc686a27c00e1e02833d1c9435d3d46 |
Ready for the next step
There are three different data content and each type needs a corresponding JSON file :
MVP_1: 2 SNP_ID chrom pos ref alt ea af num_samples beta sebeta pval q_pval i2 direction
MVP_2: 836 SNP_ID chrom pos ref alt ea af num_samples beta sebeta pval r2
q_pval i2 direction
MVP_3: 5184 SNP_ID chrom pos ref alt ea af num_samples case_af num_cases control_af num_controls or
ci
pval r2 q_pval i2 direction
Example of each type: MVP_1: MVP_R4.1000G_AGR.GIA.Labs_batch1/MVP_R4.1000G_AGR.Albumin_Mean_INT.META.GIA.dbGaP.txt.gz MVP_2: MVP_R4.1000G_AGR.GIA.Labs_batch1/MVP_R4.1000G_AGR.A1C_Max_INT.AFR.GIA.dbGaP.txt.gz MVP_3: MVP_R4.1000G_AGR.GIA.PheCodes_CirculatorySystem_batch1/MVP_R4.1000G_AGR.Phe_394.AFR.GIA.dbGaP.txt.gz
MVP3 does not have standerror, and cannot pass validation. For the MVP3, we can submit via our system and they will be labelled as non-GWAS-SSF and need to change them as pre-GWAS-SSF later for harmonisation.
MVP data have another issue, some base_pair_location values are missing, however, they have rsid, which can still be harmonised, but will not pass the validation.
Start formatting the data.
@Santhi1901 will do the curation for the paper
Here is the information Santhi curated: https://docs.google.com/spreadsheets/d/1Ny9krC0rKF7KacVzobA_MGXLuW-e6vwR/edit?gid=1487741722#gid=1487741722
Here is the data overlapped between Santhi curated from publication and yue downloaded from FTP: https://docs.google.com/spreadsheets/d/1as93o8CfnwDMrHbB1Vv6YvyRYOCaqA2xWj_Yp5NkX48/edit?gid=0#gid=0
There are 4332/6023 studies overlapped.
Each study on FTP also has a metadata.txt file; we need to work together to get this information.
@jiyue1214. I've now added the meta-analysis result, and it looks like all the MVP data were added. For ease of checking, I placed the information here: https://docs.google.com/spreadsheets/d/1bfqExnmb_qktA-o3lYu3BtXPPfYMSfEpAuHI2uEwFFo/edit?gid=2036037118#gid=2036037118.
Please check and let me know if anything needs to be changed.
I also need to request EFO for some traits that have matched the parent trait for now.
MVP data: MVP is an ongoing prospective cohort study and mega‐biobank in the Department of Veterans Affairs Healthcare System designed to study genetic influences on health and disease among veterans. This is the accession to hold publicly available results.
It includes genome-wide associations for 2068 traits from 635,969 participants in the Department of Veterans Affairs Million Veteran Program, a longitudinal study of diverse United States Veterans.
Some useful links: