getian107 / PRScsx

Cross-population polygenic prediction
MIT License
65 stars 20 forks source link

Getting error when using PRScsx #28

Closed ShoaibBME closed 1 year ago

ShoaibBME commented 1 year ago

Hi I am trying to apply PRScsx using my genotyping data, this is how my code looks like, please help me to solve this issue
python PRScsx/PRScsx.py --ref_dir=/hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_recall/imputation/beagle/current/Shoaib/test/ldblk_1kg_eur/ldblk_1kg_eur --bim_prefix=/hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_recall/imputation/beagle/current/Shoaib/test/osc.QC1 --sst_file=/hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_recall/imputation/beagle/current/Shoaib/test/sumstatsn.txt --n_gwas=38691 --pop=EUR --chrom=21 --out_dir=/hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_recall/imputation/beagle/current/Shoaib/test --out_name=PRS

--ref_dir=/hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_recall/imputation/beagle/current/Shoaib/test/ldblk_1kg_eur/ldblk_1kg_eur --bim_prefix=/hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_recall/imputation/beagle/current/Shoaib/test/osc.QC1 --sst_file=['/hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_recall/imputation/beagle/current/Shoaib/test/sumstatsn.txt'] --a=1 --b=0.5 --phi=None --n_gwas=[38691] --pop=['EUR'] --n_iter=1000 --n_burnin=500 --thin=5 --out_dir=/hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_recall/imputation/beagle/current/Shoaib/test --out_name=PRS --chrom=['21'] --meta=FALSE --seed=None

only 1 discovery population detected

process chromosome 21

... parse bim file: /hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_recall/imputation/beagle/current/Shoaib/test/osc.QC1.bim ... ... 118685 SNPs on chromosome 21 read from /hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_recall/imputation/beagle/current/Shoaib/test/osc.QC1.bim ... Traceback (most recent call last): File "/hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_recall/imputation/beagle/v0.1.3/Shoaib/test/PRScsx/PRScsx.py", line 215, in main() File "/hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_recall/imputation/beagle/v0.1.3/Shoaib/test/PRScsx/PRScsx.py", line 198, in main sst_dict[pp] = parse_genet.parse_sumstats(ref_dict, vld_dict, param_dict['sst_file'][pp], param_dict['pop'][pp], param_dict['n_gwas'][pp]) ^^^^^^^^ UnboundLocalError: cannot access local variable 'ref_dict' where it is not associated with a value

getian107 commented 1 year ago

Hi-- It looks that you only have input data from one population? In this case PRS-CS might be the tool you'd like to use. Using PRS-CSx for one input population is also fine and should be equivalent to PRS-CS. I think the issue here was that the directory to the reference panel was not correctly specified -- you only need to specify the path to the directory that contains the SNP information file and LD reference panels and don't need to point to the reference of a specific population. This is slightly different from how reference is specified in PRS-CS.

ShoaibBME commented 1 year ago

Hi Tian, Thanks for prompt response, really appreciate it. PRS-CS seems working in my case as I have european GWAS summary stats and genotyping data of europeans. My group is wondering if we can apply this tool, in case we want to calculate PRS in south asian and east asian individuals as well. The important point here, we only have summary stats of EUR ancestry and we want to calculate PRS in South Asian ppl in addition to europeans. You think, this tool can be used as it is based in LD reference panels from different ancestries. If this can be done, with a single GWAS how accurate PRS in other ancestries would be, This was the point of our main discussion in our group meeting today. Can you please send paper, where someone used PRScsx with just one summary stats and calculated PRS in two or more population. Thanks so much for your amazing work

getian107 commented 1 year ago

Hi - The main difference between PRS-CS and PRS-CSx is that PRS-CS can only be applied to GWAS summary statistics from one population while PRS-CSx can integrate summary statistics from multiple populations. If you only have European summary statistics and would like to evaluate the score in non-European populations, you can use PRS-CS. It's a valid analysis although the PRS is expected to be less accurate in non-European samples. Applying PRS-CSx to the European summary statistics in this case doesn't make the score more portable; in fact PRS-CSx reduces to PRS-CS when the input is from a single population.

ShoaibBME commented 1 year ago

Hi Tian, when I am using single GWAS summary stats, single LD ref panel, PRScs is working well, but I am continuously getting error when I am using PRScsx, please check what is the potential problem, I have correctly mentioned the ref directory where the ref panel and snp_info from 1000 genome is available, still unable to find the cause of error when using PRScsx

python PRScsx.py --ref_dir=/hpf/projects/arnold/users/Shoaib2/ldblk_1kg_eas --bim_prefix=/hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_EAS_recall/imputation/beagle/v0.1.4/plink_hard_calls/osc_eas_beagle.hardcalls --sst_file=/hpf/projects/arnold/users/Shoaib2/sumstatsn.txt --n_gwas=38691 --pop=EAS --chrom=21 --out_dir=/hpf/projects/arnold/users/Shoaib2/ --out_name=PRS

--ref_dir=/hpf/projects/arnold/users/Shoaib2/ldblk_1kg_eas --bim_prefix=/hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_EAS_recall/imputation/beagle/v0.1.4/plink_hard_calls/osc_eas_beagle.hardcalls --sst_file=['/hpf/projects/arnold/users/Shoaib2/sumstatsn.txt'] --a=1 --b=0.5 --phi=None --n_gwas=[38691] --pop=['EAS'] --n_iter=1000 --n_burnin=500 --thin=5 --out_dir=/hpf/projects/arnold/users/Shoaib2/ --out_name=PRS --chrom=['21'] --meta=FALSE --seed=None

only 1 discovery population detected

process chromosome 21

... parse bim file: /hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_EAS_recall/imputation/beagle/v0.1.4/plink_hard_calls/osc_eas_beagle.hardcalls.bim ... ... 103373 SNPs on chromosome 21 read from /hpf/projects/arnold/data/genotypes/OSC_HumanCoreExome_EAS_recall/imputation/beagle/v0.1.4/plink_hard_calls/osc_eas_beagle.hardcalls.bim ... Traceback (most recent call last): File "PRScsx.py", line 215, in main() File "PRScsx.py", line 198, in main sst_dict[pp] = parse_genet.parse_sumstats(ref_dict, vld_dict, param_dict['sst_file'][pp], param_dict['pop'][pp], param_dict['n_gwas'][pp]) UnboundLocalError: local variable 'ref_dict' referenced before assignment

getian107 commented 1 year ago

Hi - I think the problem is still the specification of the reference panel. In --ref_dir you only need to point to a directory that contains snpinfo_mult_1kg_hm3 and ldblk_1kg_eur, ldblk_1kg_eas, etc. You don't need to specify ldblk_1kg_eas in --ref_dir. The population of the input summary stats is specified using --pop in PRS-CSx. When you use --pop=EAS, the program will look for ldblk_1kg_eas within the directory you specified by --ref_dir. This is slightly different from how reference directory is specified in PRS-CS. Let me know if this resolves your issue.

loveyu3317 commented 1 year ago

You can try to use --ref_dir=/hpf/projects/arnold/users/Shoaib2 and put snpinfo_mult_ukbb_hm3 into Shoaib2 folder.

rskl92 commented 1 year ago

Hi, I am also experiencing a similar issue. I am trying to run the script on one chromosome to try it.

process chromosome 21

... parse bim file: /home/PRScx/Target/filtered_callRate_Chr21.bim ... ... 109196 SNPs on chromosome 21 read from /home/PRScx/Target/filtered_callRate_Chr21.bim ... Traceback (most recent call last): File "/home/PRScx/PRScsx-master/PRScsx.py", line 215, in main() File "/home/PRScx/PRScsx-master/PRScsx.py", line 198, in main sst_dict[pp] = parse_genet.parse_sumstats(ref_dict, vld_dict, param_dict['sst_file'][pp], param_dict['pop'][pp], param_dict['n_gwas'][pp]) UnboundLocalError: local variable 'ref_dict' referenced before assignment

I specify

export MKL_NUM_THREADS=$N_THREADS export NUMEXPR_NUM_THREADS=$N_THREADS export OMP_NUM_THREADS=$N_THREADS

chr=21 python $codeDir/PRScsx.py --ref_dir=$refDir --bim_prefix=$targetDir/filtered_callRate_Chr${chr} --sst_file=$discoveryDir/sumstats1.csv,$discoveryDir/sumstats2.csv --n_gwas=8006,788989 --pop=EUR,AFR --out_dir=$outputDir/PRS --out_name=PRS_Chr${chr} --phi=1e-2 --n_iter=2000 --n_burnin=1000 --thin=5 --chrom=${chr} --seed=1234

I would appreciate your advice. Thanks

getian107 commented 1 year ago

Hi - There seems to be a problem when specifying the directory of the reference panel. Could you double check that the directory for --ref_dir contains snpinfo_mult_1kg_hm3 and population-specific reference panels (i.e., ldblk_1kg_eur, ldblk_1kg_eas, ldblk_1kg_afr, etc)?

nini-tech23 commented 1 year ago

Hi, I am also experiencing same error while running PRScsx. Traceback (most recent call last): File "/bin2/PRScsx", line 215, in main() File "/bin2/PRScsx", line 198, in main sst_dict[pp] = parse_genet.parse_sumstats(ref_dict, vld_dict, param_dict['sst_file'][pp], param_dict['pop'][pp], param_dict['n_gwas'][pp]) ^^^^^^^^ UnboundLocalError: cannot access local variable 'ref_dict' where it is not associated with a value

Even I specified the directory of the reference panel (snpinfo_mult_1kg_hm3,ldblk_1kg_eur, ldblk_1kg_eas) correctly, I still got same error. Is there any other factor causing this error except wrong specification of the directory?

getian107 commented 1 year ago

Hi - Could you send the commend line that you used to call PRS-CSx?

nini-tech23 commented 1 year ago

Hello @getian107, here is my command line:

python3 /home2/ldgbsr/UA_PRS_project/Code/Tools/PRScsx/PRScsx.py --ref_dir=/data6/PRScsx_data/PRScsx-tmp/1000G --bim_prefix=/home2/ldgbsr/UA_PRS_project/Analysis/BestFit/BFset_CAXA_allQCed --sst_file=/home2/ldgbsr/UA_PRS_project/Analysis/PRScsx/UKB_UA.001.sumstats,/home2/ldgbsr/UA_PRS_project/Analysis/PRScsx/JPB2021_UA.001.sumstats --pop=EUR,EAS --n_gwas=437354,129405 --phi=1e-2 --chrom=1 --out_dir=/home2/ldgbsr/UA_PRS_project/Analysis/PRScsx/ --out_name=eas_eur_ua_beta

Output: --ref_dir=/data6/PRScsx_data/PRScsx-tmp/1000G --bim_prefix=/home2/ldgbsr/UA_PRS_project/Analysis/BestFit/BFset_CAXA_allQCed --sst_file=['/home2/ldgbsr/UA_PRS_project/Analysis/PRScsx/UKB_UA.001.sumstats', '/home2/ldgbsr/UA_PRS_project/Analysis/PRScsx/JPB2021_UA.001.sumstats'] --a=1 --b=0.5 --phi=0.01 --n_gwas=[437354, 129405] --pop=['EUR', 'EAS'] --n_iter=2000 --n_burnin=1000 --thin=5 --out_dir=/home2/ldgbsr/UA_PRS_project/Analysis/PRScsx/ --out_name=eas_eur_ua_beta --chrom=['1'] --meta=FALSE --seed=None

2 discovery populations detected

process chromosome 1

... parse bim file: /home2/ldgbsr/UA_PRS_project/Analysis/BestFit/BFset_CAXA_allQCed.bim ... ... 318938 SNPs on chromosome 1 read from /home2/ldgbsr/UA_PRS_project/Analysis/BestFit/BFset_CAXA_allQCed.bim ... Traceback (most recent call last): File "/home2/ldgbsr/UA_PRS_project/Code/Tools/PRScsx/PRScsx.py", line 154, in main() File "/home2/ldgbsr/UA_PRS_project/Code/Tools/PRScsx/PRScsx.py", line 137, in main sst_dict[pp] = parse_genet.parse_sumstats(ref_dict, vld_dict, param_dict['sst_file'][pp], param_dict['pop'][pp], param_dict['n_gwas'][pp]) ^^^^^^^^ UnboundLocalError: cannot access local variable 'ref_dict' where it is not associated with a value

getian107 commented 1 year ago

Hi- Do you mind showing a list of the files under /data6/PRScsx_data/PRScsx-tmp/1000G

nini-tech23 commented 1 year ago

1000G reference directory contains the following directories: ldblk_1kg_afr ldblk_1kg_amr ldblk_1kg_eas ldblk_1kg_eur ldblk_1kg_sas Then, each directories is consisted of following files: ldblk_1kg_chr10.hdf5 ldblk_1kg_chr16.hdf5 ldblk_1kg_chr21.hdf5 ldblk_1kg_chr6.hdf5 ldblk_1kg_chr11.hdf5 ldblk_1kg_chr17.hdf5 ldblk_1kg_chr22.hdf5 ldblk_1kg_chr7.hdf5 ldblk_1kg_chr12.hdf5 ldblk_1kg_chr18.hdf5 ldblk_1kg_chr2.hdf5 ldblk_1kg_chr8.hdf5 ldblk_1kg_chr13.hdf5 ldblk_1kg_chr19.hdf5 ldblk_1kg_chr3.hdf5 ldblk_1kg_chr9.hdf5 ldblk_1kg_chr14.hdf5 ldblk_1kg_chr1.hdf5 ldblk_1kg_chr4.hdf5 snpinfo_1kg_hm3 ldblk_1kg_chr15.hdf5 ldblk_1kg_chr20.hdf5 ldblk_1kg_chr5.hdf5

getian107 commented 1 year ago

I think you need to download the SNP information file for PRS-CSx and put it in the 1000G directory.

nini-tech23 commented 1 year ago

Thank you for the solution, @getian107. It solved my error. I was using the file named snpinfo_1kg_hm3 instead of snpinfo_mult_1kg_hm3, so it seems like caused the error. Also, I am curious about whether I need to use only SNPs in sumstats that are higher than certain MAF and INFO thresholds. If so, what are MAF and INFO threshold requirements?

getian107 commented 1 year ago

The reference panels have filtered to HapMap3 variants with MAF >1% (and INFO >0.8 for the UK Biobank reference). You can further apply MAF and INFO filtering in your target dataset follow standard GWAS QC procedures. Filtering of SNPs in the summary statistics is usually unnecessary.