bvilhjal / ldpred

MIT License
96 stars 57 forks source link

Diseased samples got lower PRS than healthy samples #120

Closed Qinqiword closed 4 years ago

Qinqiword commented 4 years ago

Hi bvilhjal,

I am using LDpred to calculate PRS for a disease, recently. The ss_file is from one published GWAS study (containing 191764 individuals and 8441806 SNPs), and the LD-reference file and the validation file are the same genotype file, containing 2638 individuals (1542 case samples and 1096 control samples) and 18058448 SNPs, ss_file and gf have the same ancestry.

Using these files as input, I got the PRS using three models: ldpred_inf, ldpred and P+T. But when I checked the PRS of the three runs, I found that in the top 10% samples with the highest PRS, only 30% individuals have that disease, but in the bottom 10% samples with the lowest PRS, 78% individuals have that disease. The p-value of Chi-square test was significant. This is very strange; individuals with higher PRS are more likely to be healthy, but individuals with a lower PRS are more likely to have that disease.

In another try, I selected 88 SNPs from the ss_file using p-value < 5×10-8 to calculate PRS in the same population by the traditional formula: PRS = b1x1+b2x2+ ....+bkxk+bnxn. In the 88_PRS, among the top 10% samples with the highest PRS, 72% were patients, and 28% were healthy. That is expected.

It is weird that this two result were converse. I checked the input files and code but failed to find the reasons. One thing I suspected is the following lines in the log file. “PRS correlation” and “The slope for predictions with weighted effects” are both negative.

Current PRS correlation: -0.2480
Current PRS r2: 0.0615
PRS correlation: -0.2480
Variance explained (Pearson R2) by PRS: 0.0615
The slope for predictions with weighted effects is: -0.0030

Here are my questions:

  1. Is it normal that diseased samples have lower PRS? you know, the same disease and same ancestral group.
  2. Why the two results were opposite, the one calculated by LDpred and the one by manual calculation? Even though LDpred takes LD information into consideration, the two results were too different.
  3. What does it mean I got negative correlation and negative slopes for the prediction?

Here are my scripts and log.file:

date
echo '################################step 1: coordinate genotype and summary statistics datasets##########################'

pythonLDpred.py --debug coord --gf=chr_all_2638  --ssf=BBJ_chr_all_input.txt --ssf-format=STANDARD  --beta  --N=191764 --out=chr_all_2638.coord.hdf5

echo '###########################step 2: calculate the snp weight. ##########LDpred.py##############################'

echo '########################### ldpred_inf #####################'

pythonLDpred.py --debug inf --cf=chr_all_2638.coord.hdf5  --ldr=2600   --ldf=chr_all_2638  --N=191764  --out=chr_all_2638

echo '########################## ldpred_gibbs ####################'
pythonLDpred.py --debug gibbs --cf=chr_all_2638.coord.hdf5   --ldr=2600   --ldf=chr_all_2638 --N=191764 --out=chr_all_2638

echo '####################### P+T #######################'
echo '####P+T r2=0.2####'
pythonLDpred.py --debug p+t --cf=chr_all_2638.coord.hdf5  --ldr=2600  --r2=0.2  --out=chr_all_2638
echo '####P+T r2=0.4####'
pythonLDpred.py --debug p+t --cf=chr_all_2638.coord.hdf5  --ldr=2600  --r2=0.4  --out=chr_all_2638

echo '####P+T r2=0.6####'
pythonLDpred.py --debug p+t --cf=chr_all_2638.coord.hdf5  --ldr=2600  --r2=0.6  --out=chr_all_2638

echo '####P+T r2=0.8####'
pythonLDpred.py --debug p+t --cf=chr_all_2638.coord.hdf5  --ldr=2600  --r2=0.8  --out=chr_all_2638

echo '############################step 3: calculate polygenic risk score.#############################'

echo '##### PRS for Any score######'

pythonLDpred.py --debug score --gf=chr_all_2638  --rf=chr_all_2638  --out=chr_all_2638_PRS

echo '##### PRS for  P+T r2=0.4######'
pythonLDpred.py --debug score --gf=chr_all_2638  --rf=chr_all_2638 --rf-format=P+T --r2=0.4  --out=chr_all_2638_r20.4_PRS

echo '##### PRS for  P+T r2=0.6######'
pythonLDpred.py --debug score --gf=chr_all_2638  --rf=chr_all_2638 --rf-format=P+T --r2=0.6  --out=chr_all_2638_r20.6_PRS

echo '##### PRS for  P+T r2=0.8######'
pythonLDpred.py --debug score --gf=chr_all_2638  --rf=chr_all_2638 --rf-format=P+T --r2=0.8  --out=chr_all_2638_r20.8_PRS

date

And the log file.

Any comments or suggestions would be appreciated.

Best wishes, Jing

bvilhjal commented 4 years ago

I thought I had fixed this, but apparently it seems like it could be caused by a a bug (or weird feature) in plinkio. I committed a fix on the master branch (v. 1.0.9). Would you mind verifying that it works?

Best, Bjarni

bvilhjal commented 4 years ago

I'll close this now, as I believe this is fixed.

jackosullivanoxford commented 4 years ago

Hi Bjarni,

I am unfortunately still getting the same inverse of results (higher PRS with lower disease prevalence). Has this been fixed?

With these apparently inverted results, is it appropriate to take the inverse of the PRS (e.g. make the PRS negative to reverse the order), which strikes me as similar to switching the effect and the reference allele and consequently inversing the odds ratio?

Thanks

J

On Mon, Oct 21, 2019 at 2:43 AM Bjarni J. Vilhjalmsson < notifications@github.com> wrote:

I thought I had fixed this, but apparently it seems like it could be caused by a a bug (or weird feature) in plinkio. I committed a fix on the master branch. Would you mind verifying that it works?

Best, Bjarni

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bvilhjal/ldpred/issues/120?email_source=notifications&email_token=AILQP6NOPQOTISUBCUD26ELQPV2VFA5CNFSM4JB3VCG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBZXDNA#issuecomment-544436660, or unsubscribe https://github.com/notifications/unsubscribe-auth/AILQP6MHJCNC3IVS53PDJD3QPV2VFANCNFSM4JB3VCGQ .

bbitarello commented 3 years ago

I think it hasn't been fixed. I am using v.10 and having the same issue