merge multiple GWAS files

jielab commented 3 years ago

Hi, guys:

We could use bcftools to merge multiple regular VCF files with genotype data.

Now, I have multiple GWAS VCF files, each of which have rsID, REF, ALT, BETA, SE, P, etc. Can I use bcftools to merge them so that I could then extract the BETA and SE from the merged file to run downstream analyses? Is there other tools for doing this of merging for multiple GWAS files, which usually have millions of SNPs? The key here is that most of these GWAS files only have A1 and A2 instead of REF and ALT.

I wish that GWAS VCF files are widely used, but these days many software such as LDSC and PheWEB and 2-sample-MR don't support VCF format.

Best regards, jie

mcgml commented 3 years ago

Hi Jie

Yes you can, see example command. I usually use bcftools but you can also use GATK/Picard. If you used gwas2vcf to prepare your VCF files then the alleles/effect sign are automatically flipped so that the REF allele is non-effect and the beta relates to the ALT allele. In which case all GWAS are comparable.

Cheers Matt

jielab commented 3 years ago

Dear Matt:

Thank you very much!

Just curious, it says that "GWAS2VCF produces GWAS-VCF format files". But, GWAS-VCF format is simply the standard VCF format, correct? Of course, the VCF files for GWAS do not have genotype data, but only have summary statistics data, just like the dbSNP summary files located at https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/.

I have been using GCTA and LDSC to run many GWAS. LDSC uses munge_sumstats.py to reformat GWAS files. So does GCTA, it requires certain columns with certain names. Recently, I also began to use PheWEB. The problem is that none of these "mainstraim" software supports VCF format. Then I have to use bcftools to generate TXT formats for these software. I really wish that the community begans to support and adopt VCF format, especially given that GWAS files now include millions of rows and need fast query.

What is your perspective on this?

Best regards, Jie

jielab commented 3 years ago

Dear Matt:

Thank you very much!

Just curious, it says that "GWAS2VCF produces GWAS-VCF format files". But, GWAS-VCF format is simply the standard VCF format, correct? Of course, the VCF files for GWAS do not have genotype data, but only have summary statistics data, just like the dbSNP summary files located at https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/.

I have been using GCTA and LDSC to run many GWAS. LDSC uses munge_sumstats.py to reformat GWAS files. So does GCTA, it requires certain columns with certain names. Recently, I also began to use PheWEB. The problem is that none of these "mainstraim" software supports VCF format. Then I have to use bcftools to generate TXT formats for these software. I really wish that the community begans to support and adopt VCF format, especially given that GWAS files now include millions of rows and need fast query.

What is your perspective on this?

Best regards, Jie

mcgml commented 3 years ago

Hi Jie,

Yes, GWAS-VCF is just a suggested standard for using VCF to store summary stats, you could prepare your own and even use different keys/columns but we aim for consistency to allow inter-study comparisons. I recently uploaded a diagram of the gwas2vcf workflow which shows each step and might be of interest.

With respect to compatibility with existing tools, my colleague is developing an R-package gwasglue that automates analysis of summary stats in VCF using a range of tools. We also have a fork of LDSC which reads from VCF but only supports univariate analysis at the moment. These projects are under active development and we hope to provide integrations with other tools in the future.

Thanks Matt

jielab commented 3 years ago

Dear Matt:

Thank you very much for letting me know gwasglue. It is really a great idea. And I strongly feel that the human genomics research field needs something like this. I just posted and suggested a few powerful tools published this year https://github.com/MRCIEU/gwasglue/issues/27. As you know, most researchers woud like to use newly published tools, and usually one is enough for each category of analysis.

For GWAS2VCF, I also have some minor suggestions:

It would be nice (and necessary) for the params.json file to be a bit flexible. Counting columnn numbers (starting from 0) is a bit time error-prone, especially when the GWAS file has a lot of columns. It would be great that we could also use column names when header:true, such as: chr_col: chrom,chr,chromosome. Yes, it would be wonderful to allow multiple options, separate by comma or anything you guys prefer.
It would be good for GWAS2VCF to make sure rsID is there and it is correct. Since it calls the dbSNP reference file, which has rsID and CHR:POS:REF:ALT. Like PLINK, it would be nice for GWAS2VCF to write out a log file reporting non-matched rsID. It could add rsID if the original GWAS file does not exist. As we know, these days many raw GWAS files don't have rsIDs, which rsID is still a mandatory field for many software such as LDSC.

Thank you for building the nice GWAS2VCF and pushing for a standard, which is dearly needed.

Best regards, Jie

jielab commented 3 years ago

Dear Matt:

I found that it takes super long time to run GWAS2VCF on my laptop. I had to kill it after running a few hours...

These days, some GWAS are in tabix indexed tab delimited BED format, for example, the UK Biobank biochemistry GWAS (A Nature Genetics paper), posted at https://doi.org/10.35092/yhjc.12355382. It has 35 GWAS in .GZ and .TBI format. I think this format is similar as VCF, much better than a regular TXT file, much faster for query.

Don't know if GWAS2VCF has a fast way to work on tabix indexed files like these. I think bcftools could work on these tabix indexed files directly.

BTW, there is python version of GWAS2VCF and also gwasvcf R package. What is the main difference between these two? Right now, I am using seqminer R package to read these .GZ files into R.

Best regards, Jie

MRCIEU / gwas2vcf

merge multiple GWAS files #70