getian107 / PRScsx

Cross-population polygenic prediction
MIT License
65 stars 20 forks source link

PRS-CSx

PRS-CSx is a Python based command line tool that integrates GWAS summary statistics and external LD reference panels from multiple populations to improve cross-population polygenic prediction. Posterior SNP effect sizes are inferred under coupled continuous shrinkage (CS) priors across populations.

Version History

May 14, 2024: Replaced some scipy functions with numpy due to changes in the latest scipy version.

🔴 Aug 10, 2023: Added BETA/OR + SE as a new input format (see the format of GWAS summary statistics below), which is now the recommended input data. When using BETA/OR + P as the input, p-values smaller than 1e-323 are truncated, which may reduce prediction accuracy for traits that have highly significant loci.

July 29, 2021: Changed default MCMC parameters.

Jun 4, 2021: Expanded reference panels to five populations.

May 26, 2021: Added suggestions for limiting the number of threads in scipy when running PRS-CS (see Computational Efficiency section below).

Apr 6, 2021: Added projection of the LD matrix to its nearest non-negative definite matrix.

Mar 4, 2021: LD reference panels constructed using the UK Biobank data are now available.

Jan 4, 2021: Improved the accuracy and robustness of random sampling from the generalized inverse Gaussian distribution. Prediction accuracy will probably slightly improve over the initial release.

Dec 20, 2020: Repository made public.

Getting Started

Using PRS-CSx

python PRScsx.py --ref_dir=PATH_TO_REFERENCE --bim_prefix=VALIDATION_BIM_PREFIX --sst_file=SUM_STATS_FILE --n_gwas=GWAS_SAMPLE_SIZE --pop=POPULATION --out_dir=OUTPUT_DIR --out_name=OUTPUT_FILE_PREFIX [--a=PARAM_A --b=PARAM_B --phi=PARAM_PHI --n_iter=MCMC_ITERATIONS --n_burnin=MCMC_BURNIN --thin=MCMC_THINNING_FACTOR --chrom=CHROM --meta=META_FLAG --seed=SEED]

    SNP          A1   A2   BETA      SE
    rs4970383    C    A    -0.0064   0.0090
    rs4475691    C    T    -0.0145   0.0094
    rs13302982   A    G    -0.0232   0.0199
    ...

Or:

    SNP          A1   A2   OR        SE
    rs4970383    A    C    0.9825    0.0314                 
    rs4475691    T    C    0.9436    0.0319
    rs13302982   A    G    1.1337    0.0543
    ...

where SNP is the rs ID, A1 is the effect allele, A2 is the alternative allele, BETA/OR is the effect/odds ratio of the A1 allele, SE is the standard error of the effect. Note that when OR is used, SE corresponds to the standard error of logOR.

When using BETA/OR + P as the input, the file must have the following format (including the header line):

    SNP          A1   A2   BETA      P
    rs4970383    C    A    -0.0064   0.4778
    rs4475691    C    T    -0.0145   0.1245
    rs13302982   A    G    -0.0232   0.2429
    ...

Or:

    SNP          A1   A2   OR        P
    rs4970383    A    C    0.9825    0.5737                 
    rs4475691    T    C    0.9436    0.0691
    rs13302982   A    G    1.1337    0.0209
    ...

where SNP is the rs ID, A1 is the effect allele, A2 is the alternative allele, BETA/OR is the effect/odds ratio of the A1 allele, P is the p-value of the effect. Here, a standardized effect size is calculated using the p-value while BETA/OR is only used to determine the direction of an association. Therefore if z-scores or even +1/-1 indicating effect directions are presented in the BETA column, the algorithm should still work properly.

Output

For each input GWAS, PRS-CSx writes posterior SNP effect size estimates for each chromosome to the user-specified directory. The output file contains chromosome, rs ID, base position, A1, A2 and posterior effect size estimate for each SNP. If --meta=True, meta-analyzed posterior effect sizes will also be written to the output directory. An individual-level polygenic score can be produced by concatenating output files from all chromosomes and then using PLINK's --score command (https://www.cog-genomics.org/plink/1.9/score). If polygenic scores are generated by chromosome, use the 'sum' modifier so that they can be combined into a genome-wide score.

In general, there are often two approaches to use the output of PRS-CSx. Given a global shrinkage parameter, the first approach calculates one polygenic score for each discovery population using population-specific posterior SNP effect size estimates and learns a linear combination of the polygenic scores that most accurately predicts the trait in the validation dataset. The optimal global shrinkage parameter and linear combination weights are then taken to an independent dataset, where the predictive performance of the final PRS can be assessed. We recommend standardizing the polygenic scores (i.e., converting the scores to zero mean and unit variance) in both validation and testing datasets before learning/applying the linear combination. Alternatively, the 'auto' and 'meta' version of the PRS-CSx algorithm can be used, which does not require a validation dataset to tune hyper-parameters. This approach may be less accurate compared to the linear combination approach (where the final PRS is optimized in a specific population) but can be useful when a validation dataset is not available (e.g., due to limited total sample size in the target dataset).

Computational Efficiency

PRS-CSx relies on scipy packages, which automatically use all available cores on a compute node. This can be problematic when running PRS-CSx on a compute cluster; PRS-CSx jobs may interfere with other jobs running on the same node, reducing computational efficiency. To resolve this issue, including the following code in the script to specify the number of threads in scipy:

export MKL_NUM_THREADS=$N_THREADS
export NUMEXPR_NUM_THREADS=$N_THREADS
export OMP_NUM_THREADS=$N_THREADS

For example, to use a single thread for the computation, set N_THREADS=1.

Test Data

The test data contains EUR and EAS GWAS summary statistics and a bim file for 1,000 SNPs on chromosome 22. An example to use the test data:

python PRScsx.py --ref_dir=path_to_ref --bim_prefix=path_to_bim/test --sst_file=path_to_sumstats/EUR_sumstats.txt,path_to_sumstats/EAS_sumstats.txt --n_gwas=200000,100000 --pop=EUR,EAS --chrom=22 --phi=1e-2 --out_dir=path_to_output --out_name=test

The test data analysis would be finished in approximately 1 min when using 8Gb of RAM.

Data release

Posterior SNP effect size estimates for traits examined in the PRS-CSx publication are available here.

Posterior SNP effect size estimates for the trans-ancestry type 2 diabetes PRS developed and validated in the Genome Medicine publication are available here.

Support

Please direct any problems or questions to Tian Ge (tge1@mgh.harvard.edu).