Cloufield / GWASTutorial

GWAS Tutorial for Beginners
https://cloufield.github.io/GWASTutorial/
108 stars 33 forks source link

Plink file for calculating LD matrix not found #4

Closed Habush closed 2 months ago

Habush commented 2 months ago

Thanks for creating this amazing tutorial!

In the finemapping tutoral, the script calculate_ldr.sh calculates the LD matrix from the plink file ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.maf005.thinp020. However, this file is not provided in the 01_Dataset folder and a script to generate it doesn't exist. Where can I find this file?

Cloufield commented 2 months ago

Hi, We changed the sample genotype data to ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip last year. You can unzip it and replace the old path with the new one for now. The workflow is still the same (though results might be slightly different).

I will also update the finemapping part soon with the latest files. Thanks for pointing it out.

Cloufield commented 2 months ago

Just updated. https://cloufield.github.io/GWASTutorial/finemapping_susie/

Habush commented 2 months ago

Hi,

Thanks for the quick reply! I can now get LD matrix using the updated tutorial

If you don't mind, I'd like to ask a question regarding the way LD matrix is calculated.

My question is say I've a real GWAS summary statistics data where I know the ancestry of the population on which the association test was done (e.g EAS) and I want to use 1000 Genomics as reference panel to calculate the LD. How would I use PLINK in the same way as you did in the tutorial?. Essentially, what I'm asking is how can I generate files similar to those in ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip and use them as input for PLINK?

Apologies if this is somewhat a dumb question, I'm a newbie to GWAS computations.

Cloufield commented 2 months ago

Thanks for your question. That's also one of the questions I had when I started to learn GWAS a few years ago. You can refer to this tutorial for how to download and convert 1kg VCF to plink format: https://www.biostars.org/p/335605/ (step1 -5) Based on that, you can perform additional steps like customized QC (e.g MAF/HWE) or extracting the samples (e.g EAS) you need using plink. Then you will get the plink files ready for your LD computations.

Habush commented 2 months ago

Thanks for sharing the link. In a similar question on the same forum, I saw a comment that says that PLINK 1.9 VCF files. Hence, I'm thinking of using the population specific files (such as 1kg_eas_hg19) in gwaslab.check_available_ref() dictionary and pass that to PLINK. do you think that's okay?

Habush commented 2 months ago

Interestingly, I can't reproduce the credible sets returned by susieR after following the tutorial. Please see the attached screenshot image

The only change from the tutorial is the harmonization step. In my notebook I do the harmonization as follows:

locus.harmonize(n_cores=8,  
                basic_check=False,
                ref_seq=gl.get_path("ucsc_genome_hg19"), 
                ref_rsid_tsv=gl.get_path("1kg_dbsnp151_hg19_auto"),
                ref_rsid_vcf=gl.get_path("dbsnp_v156_hg19"))

Whereas in the tutorial it's via

locus.harmonize(basic_check=False, ref_seq="/home/yunye/CommonData/Reference/genome/humanG1Kv37/human_g1k_v37.fasta")

I wanted to attach the Jupyter notebook here but github isn't allowing me to.

Cloufield commented 2 months ago

Usually finemapping requires in-sample (the samples you used to run GWAS) LD reference. But for other purpose, you can surely use the prepared VCF files from gwaslab.

Cloufield commented 2 months ago

From the index of variants in your results, it seems that the sumstats are different (maybe different locus??). I don't think locus.harmonize will change the results. Would you please check the logs and see from which step the results become different? Thanks.

Habush commented 2 months ago

I used the locus used in the tutorial notebook locus = sumstats.filter_value('CHR==2 & POS>55074452 & POS<56074452'). I uploaded the Jupyter notebook I used to follow along the tutorial here

I realized that I generated the summary statistics data 1kgeas.B1.glm.firth using the shell scripts in 06_Association_tests folder. I did this because I couldn't find the file in the repo and perhaps I made mistake while running the scripts. If you provide me with that file, I'm happy to re-run the notebook again.

Cloufield commented 2 months ago

Oops, my fault. I guess I might have used the old version of the sumstats. I will double check, fix the file and let you know tomorrow. But I think you have basically finished the workflow correctly by checking you notebook. Sorry for my mistake.

Cloufield commented 2 months ago

Hi, I checked again but the sumstats I used were the latest one. I am wondering if you re-generate the phenotype file for GWAS? Anyway, I uploaded the sumstats 1kgeas.B1.glm.firth.gz in 12_fine_mapping. You can directly use this one.

Habush commented 2 months ago

Hi,

(Sorry for the late reply)

I ran the fine-mapping analysis with the 1kgeas.B1.glm.firth.gz file you shared above and I still get different indices than the ones in the tutorial notebook.

I am wondering if you re-generate the phenotype file for GWAS?

No, I didn't. Should I do that?

Cloufield commented 2 months ago

Hi, Sorry that I thought you might have re-generated the phenotype file which caused the difference. To be honest, I have no clue now. Would you please share again your notebook using 1kgeas.B1.glm.firth.gz ? I will look into it. Thanks.

Habush commented 2 months ago

You can find the latest version of the notebook here

Cloufield commented 2 months ago

I finally found the problem.
There was an error in default value check for OR in the old version. I think just updating gwaslab to latest version would solve it.

pip install gwaslab==3.4.43

or using the old version, in basic_check(), add

sumstats.basic_check(sanitycheckstats_args={"OR":(-10,10)})

Sorry for the inconvenience.

Habush commented 2 months ago

No worries. I'm glad you were able to find the issue.

I tried to install the latest gwaslabs via pip install gwaslab==3.4.43 but pip is failing with No matching distribution found for gwaslab==3.4.43. Perhaps, the pypi index isn't updated yet.

Hence, I added the suggested argument and ran sumstats.basic_check(sanitycheckstats_args={"OR":(-10,10)}). Now I get one credible set with 182 200 203 206 variables in it. Not exactly the variables in the tutorial (200 218 221 224) but very close, and I think we can safely close this issue. What do you think?

Cloufield commented 2 months ago

Maybe the Python version is different so that it could not be found. gwaslab now supports Python 3.9 and python3.10. For the finemapping results, I think the difference is simply due to the changed default behavior for variants checking between different versions of gwaslab. And sure, we can close it. Please let me know if you have any other questions.

Habush commented 2 months ago

Thank you for your patience and responses. I don't have further questions at moment and I'll create a new issue if I've any questions.