Inquiry on QC and GWAS Pipeline

Cloufield / gwaslab

A Python package for handling and visualizing GWAS summary statistics. https://cloufield.github.io/gwaslab/

GNU General Public License v3.0

159 stars 25 forks source link

Inquiry on QC and GWAS Pipeline #99

Open buer19970329 opened 5 months ago

buer19970329 commented 5 months ago

Hi,

Thanks for the invaluable tools and tutorial resources you provided!

I am currently working with imputed data from the UK Biobank, and I have encountered a challenge regarding the storage size of the data for each chromosome, which is significantly larger compared to the sample dataset used in your GWAS tutorial. Given this, I have a specific query regarding the pipeline:

Would there be any significant differences in the results if I perform QC on each chromosome data separately and then merge them together for the GWAS analysis, compared to merging all chromosomes data together first and then conducting the QC and GWAS? Is the former pipeline, where QC is conducted on individual chromosomes before merging, considered acceptable in the context of standard practices? Any insights or recommendations you could provide on this matter would be greatly appreciated.

Thank you for your time and assistance.

Cloufield commented 5 months ago

Hi, Sorry that I am not sure if you are asking about the QC for sumstats or genotypes. I am wondering what analysis you plan to do with the imputed data?

For GWAS sumstats, the QC/harmonization steps in basic_check() and harmonize() in gwaslab are all at single variant level, which means that performing QC on each chromosome separately and performing QC on all chromosomes at once will be the same.

For genotypes, usually, if the QC is at single variant level (like MAF, variant missing rate, HWE...), it would be completely ok to do so. For other steps that require variant information across the genome, you may need to merge the genotypes before calculation or calculate them manually based on the results for each chromosome.

buer19970329 commented 5 months ago

Hi Yunye @Cloufield ,

Thank you very much for your response and guidance.

I apologize for any confusion caused by my previous messages. I am new to the field of GWAS and still trying to understand some of the concepts. Your explanations have been very very helpful!

I plan to use the UK Biobank imputed data for conducting GWAS statistical analysis. Specifically, I intend to start with a practice run by exploring the genetic relationship with height. Based on your advice, I will perform the QC and GWAS at a single variant level for each chromosome separately and then merge the summary statistics afterwards. Additionally, I plan to follow the tutorial's guidance for performing PRS and Mendelian Randomization.

I would also like to ask for further clarification on which steps require variant information across the genome, as you mentioned. Understanding this will help me ensure that I conduct my analyses correctly.

Thank you again for your time and assistance.

Cloufield commented 5 months ago

Here is an overview of the workflow.

I think what you have now is the imputed dosage data after imputation.

The typical workflow is based on array data (much smaller in size compared with imputed data).
You perform variant and sample QC using array data. Using QCed array data, you can then conduct PCA/ relatedness estimation/ phasing and imputation. Then use imputed datasets for GWAS.

For GWAS using plink (simple linear models), the imputed dataset is sufficient. If using other tools like SAIGE/REGENIE (two-step approach), usually we need both array (for step1) and imputed datasets (for step2).

Steps require variant information across the genome: calculation of heterozygosity, sample missing rate, relatedness estimation, PCA, and so forth.

buer19970329 commented 5 months ago

@Cloufield Thank you very much for your detailed explanation. I have learned a lot from it.

However, I still have one point that I am not entirely clear on. After conducting PCA on the QCed array data to obtain variant information across the genome, such as 10 PCs across 22 chromosomes, and then I proceed to conduct GWAS on imputed data for a single chromosome using these PCs as covariates, would that be acceptable? Thank you again for your help!

Cloufield commented 5 months ago

Yes, that is the common way to do GWAS. PCs reflect the genome-wide ancestry information about individuals.

buer19970329 commented 5 months ago

@Cloufield Thank you very much for your help. I will try it now :)