choishingwan / PRSice

A software package for calculating, applying, evaluating and plotting the results of polygenic risk scores
http://prsice.info
GNU General Public License v3.0
182 stars 86 forks source link

Analysis of the extreme PRS (top 1%, low 1%) #195

Closed giuliapontali closed 4 years ago

giuliapontali commented 4 years ago

I have a question regarding results on PRS using PRSice. I have a binary trait (9730 controls and 156 cases). For the analysis, I took the top 1% of the samples, who have a high risk to develop the disease. It is interesting to see that in the top 1% I have only one participant is affected by the disease but if I analyze the low 1% I see that I have 4 participants that have the disease. Dividing PRS into quantile the majority of affected samples are in the first quantile but I expected that those will be in the tenth quantile. Why does it happen?

Thanks for your time

choishingwan commented 4 years ago

Try this one

giuliapontali commented 4 years ago

Sorry but I can not see anything

choishingwan commented 4 years ago

Try this

https://www.dropbox.com/s/kklgepakvlz5m3n/PRSice.R?dl=0

giuliapontali commented 4 years ago

Using this PRSice.R and the last version of PRSice I have the following output:

Rscript PRSice_updated.R --prsice ./PRSice_linux --A1 A1 --A2 A2 --base /home/atrial_fibrillation/af.cleaning.txt --binary-target T --ignore-fid --or --pheno /home/EUR_9919orderFam.af.tab.txt --pheno-col AF --nonfounders --pvalue Pvalue --snp MarkerName --stat OR --target /home/HRC_imputed/filtered --cov /home/atrial_fibrillation/Cov_9919orderFam.af.txt --quantile 10 --out af.updated
PRSice 2.3.0.b (2020-05-19) 
https://github.com/choishingwan/PRSice
(C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly
GNU General Public License v3
If you use PRSice in any published work, please cite:
Choi SW, O'Reilly PF.
PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.
GigaScience 8, no. 7 (July 1, 2019)
2020-05-20 11:49:09
./PRSice_linux \
    --A1 A1 \
    --A2 A2 \
    --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \
    --base /home/atrial_fibrillation/af.cleaning.txt \
    --binary-target T \
    --clump-kb 250kb \
    --clump-p 1.000000 \
    --clump-r2 0.100000 \
    --cov /home/atrial_fibrillation/Cov_9919orderFam.af.txt \
    --ignore-fid  \
    --interval 5e-05 \
    --lower 5e-08 \
    --nonfounders  \
    --num-auto 22 \
    --or  \
    --out af.updated \
    --pheno /home/EUR_9919orderFam.af.tab.txt \
    --pheno-col AF \
    --pvalue Pvalue \
    --seed 3948222806 \
    --snp MarkerName \
    --stat OR \
    --target /home/HRC_imputed/filtered \
    --thread 1 \
    --upper 0.5

Initializing Genotype file: 
/home/HRC_imputed/filtered (bed) 

Start processing af.cleaning 
================================================== 

Base file: 
/home/gpontali/atrial_fibrillation/af.cleaning.txt 
Header of file is: 

MarkerName  rs_dbSNP147 CHR POS_GRCh37  A1  A2  Freq_A2 Effect_A2   StdErr  Pvalue  OR 

Reading 100.00%
29475783 variant(s) observed in base file, with: 
713528 variant(s) located on haploid chromosome 
14136 NA stat/p-value observed 
5354 ambiguous variant(s) excluded 
28742765 total variant(s) included from base file 

Loading Genotype info from target 
================================================== 

9919 people (0 male(s), 0 female(s)) observed 
9919 founder(s) included 

3619959 variant(s) not found in previous data 
1862828 variant(s) included 

Phenotype file: /home/EUR_9919orderFam.af.tab.txt 
Column Name of Sample ID: IID 
Note: If the phenotype file does not contain a header, the 
column name will be displayed as the Sample ID which is 
expected. 

There are a total of 1 phenotype to process 

Start performing clumping 

Clumping Progress: 100.00%
Number of variant(s) after clumping : 527406 

Processing the 1 th phenotype 

AF is a binary phenotype 
9730 control(s) 
156 case(s) 

Processing the covariate file: 
/home/atrial_fibrillation/Cov_9919orderFam.af.txt 
============================== 

Include Covariates: 
Name    Missing Number of levels 
AGE 0   - 
SEX 0   - 

After reading the covariate file, 9854 sample(s) included 
in the analysis 

Start Processing
Processing 100.00%
There are 1 region(s) with p-value less than 1e-5. Please 
note that these results are inflated due to the overfitting 
inherent in finding the best-fit PRS (but it's still best 
to find the best-fit PRS!). 
You can use the --perm option (see manual) to calculate an 
empirical P-value. 

Begin plotting
Current Rscript version = 2.3.0
Plotting the quantile plot
WARNING: There are only 0 unique PRS but asked for 10 quantiles
Will not generate the quantile plot for  af.updated
Plotting Bar Plot
Plotting the high resolution plot
Warning messages:
1: In fread(paste0(prefix, ".best"), data.table = F, colClasses = c(FID = "character",  :
  Detected 4 column names but the data has 3 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.
2: In max(pheno$Pheno) : no non-missing arguments to max; returning -Inf
choishingwan commented 4 years ago

I have rewrote the Rscript and the executable. Could you please try with 2.3.1 and then re-run it. If the same error occurs, can you check if your .best file uses space or tab as delimiter? PRSice should only uses space as delimiter for the best file due to some essential algorithms

giuliapontali commented 4 years ago

Using this version I have the same error. The stranger thing is that I don't have FID column but it is created and everything is swapped, in this way in the PRS column I have all NA (I have space as delimiter):

Rscript PRSice.R --prsice ./PRSice_linux --A1 A1 --A2 A2 --base /home/atrial_fibrillation/af.cleaning.txt --binary-target T --ignore-fid --or --pheno /home/atrial_fibrillation/EUR_9919orderFam.af.txt --pheno-col AF --nonfounders --pvalue Pvalue --snp MarkerName --stat OR --target /home/chris10k_HRC_imputed/filtered --cov /home/atrial_fibrillation/Cov_9919orderFam.af.txt --quantile 10 --out af.help

.......

[Begin plotting
Current Rscript version = 2.3.1
            FID IID In_Regression PRS
1    00121 Yes    0.07330624  NA
2    00126 Yes    0.07112688  NA
3    00124 Yes    0.06475048  NA

WARNING: There are only 0 unique PRS but asked for 10 quantiles
Will not generate the quantile plot for  af.a
Plotting Bar Plot
Plotting the high resolution plot
Warning messages:
1: In fread(paste0(prefix, ".best"), data.table = F, colClasses = c(FID = "character",  :
  Detected 4 column names but the data has 3 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.
2: In max(pheno$Pheno) : no non-missing arguments to max; returning -Inf
choishingwan commented 4 years ago

Is your executable also 2.3.1?

The FID is obtained either from the fam file, or duplicates of the IID

Again, could you please include the full log if you are using 2.3.1 for both the Rscript and the executable?

On Mon, May 25, 2020 at 2:22 PM jPontix notifications@github.com wrote:

Using this version I have the same error. The stranger thing is that I don't have FID column but it is created and everything is swapped, in this way in the PRS column I have all NA (I have space as delimiter):

Rscript PRSice.R --prsice ./PRSice_linux --A1 A1 --A2 A2 --base /home/atrial_fibrillation/af.cleaning.txt --binary-target T --ignore-fid --or --pheno /home/atrial_fibrillation/EUR_9919orderFam.af.txt --pheno-col AF --nonfounders --pvalue Pvalue --snp MarkerName --stat OR --target /home/chris10k_HRC_imputed/filtered --cov /home/atrial_fibrillation/Cov_9919orderFam.af.txt --quantile 10 --out af.help

.......

[Begin plotting Current Rscript version = 2.3.1 FID IID In_Regression PRS 1 00121 Yes 0.07330624 NA 2 00126 Yes 0.07112688 NA 3 00124 Yes 0.06475048 NA

WARNING: There are only 0 unique PRS but asked for 10 quantiles Will not generate the quantile plot for af.a Plotting Bar Plot Plotting the high resolution plot Warning messages: 1: In fread(paste0(prefix, ".best"), data.table = F, colClasses = c(FID = "character", : Detected 4 column names but the data has 3 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning. 2: In max(pheno$Pheno) : no non-missing arguments to max; returning -Inf

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/195#issuecomment-633399400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYVHDGHKY4H7NQWNMRDRTIFAXANCNFSM4NAOBPFA .

giuliapontali commented 4 years ago
PRSice 2.3.1 (2020-05-23) 
https://github.com/choishingwan/PRSice
(C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly
GNU General Public License v3
If you use PRSice in any published work, please cite:
Choi SW, O'Reilly PF.
PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.
GigaScience 8, no. 7 (July 1, 2019)
2020-05-24 22:16:01
./PRSice_linux \
    --A1 A1 \
    --A2 A2 \
    --a1 A1 \
    --a2 A2 \
    --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \
    --base /home/atrial_fibrillation/af.cleaning.txt \
    --binary-target T \
    --chr CHR \
    --clump-kb 250kb \
    --clump-p 1.000000 \
    --clump-r2 0.100000 \
    --cov /home/atrial_fibrillation/Cov_9919orderFam.af.txt \
    --ignore-fid  \
    --interval 5e-05 \
    --lower 5e-08 \
    --nonfounders  \
    --num-auto 22 \
    --or  \
    --out af.a \
    --pheno /home/atrial_fibrillation/EUR_9919orderFam.af.txt \
    --pheno-col AF \
    --pvalue Pvalue \
    --seed 2963111790 \
    --snp MarkerName \
    --stat OR \
    --target /home/HRC_imputed/filtered \
    --thread 1 \
    --upper 0.5

Initializing Genotype file: 
/home/HRC_imputed/filtered (bed) 

Start processing af.cleaning 
================================================== 

Base file: 
/home/atrial_fibrillation/af.cleaning.txt 
Header of file is: 

MarkerName  rs_dbSNP147 CHR POS_GRCh37  A1  A2  Freq_A2 Effect_A2   StdErr  Pvalue  OR 

29475783 variant(s) observed in base file, with: 
713528 variant(s) located on haploid chromosome 
14136 NA stat/p-value observed 
5354 ambiguous variant(s) excluded 
28742765 total variant(s) included from base file 

Loading Genotype info from target 
================================================== 

9919 people (0 male(s), 0 female(s)) observed 
9919 founder(s) included 

3619959 variant(s) not found in previous data 
1862828 variant(s) included 

Phenotype file: 
/home/atrial_fibrillation/EUR_9919orderFam.af.txt 
Column Name of Sample ID: IID 
Note: If the phenotype file does not contain a header, the 
column name will be displayed as the Sample ID which is 
expected. 

There are a total of 1 phenotype to process 

Start performing clumping 

Number of variant(s) after clumping : 527406 

Processing the 1 th phenotype 

AF is a binary phenotype 
9730 control(s) 
156 case(s) 

Processing the covariate file: 
/home/atrial_fibrillation/Cov_9919orderFam.af.txt 
============================== 

Include Covariates: 
Name    Missing Number of levels 
AGE 0   - 
SEX 0   - 

After reading the covariate file, 9854 sample(s) included 
in the analysis 

There are 1 region(s) with p-value less than 1e-5. Please 
note that these results are inflated due to the overfitting 
inherent in finding the best-fit PRS (but it's still best 
to find the best-fit PRS!). 
You can use the --perm option (see manual) to calculate an 
empirical P-value. 
choishingwan commented 4 years ago

Could you please send me the first ten lines of your fam and phenotype file (preferably including cases and control and have all samples matched) in an attachment to me? You can mask the same ID. Maybe there are some strange issue with your file.

giuliapontali commented 4 years ago

here .fam file:

00321 00321 0 0 0 -9
00296 00296 0 0 0 -9
00771 00771 0 0 0 -9
00134 00134 0 0 0 -9
00950 00950 0 0 0 -9
00006 00006 0 0 0 -9
00175 00175 0 0 0 -9
00403 00403 0 0 0 -9
00245 00245 0 0 0 -9
00123 00123 0 0 0 -9

here the pheno_file:

IID AF
00321 0
00296 0
00771 0
00134 0
00950 0
00006 0
00175 0
00403 0
00245 0
00123 1
choishingwan commented 4 years ago

Can you send it out as an attachment? That way I can make sure that any special characters or other abnormality can be kept

choishingwan commented 4 years ago

Thanks. Have now updated 2.3.1.b, which hopefully should solve your problem

On Tue, May 26, 2020 at 3:39 PM jPontix notifications@github.com wrote:

https://www.dropbox.com/sh/xs2ekt3o740klgm/AAA5jKasstTpo__k4abHuogqa?dl=0

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/195#issuecomment-633862800, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYTIMUUTV6D6N5BWN23RTNW3JANCNFSM4NAOBPFA .

giuliapontali commented 4 years ago

I have the same error as before :|

choishingwan commented 4 years ago

If you are using the latest copy, then I am really stomp. Coz using the attached data you sent me, I was able to run the software without encountering any error message and as of now, I have no idea what else we can do to figure out what's going on.

So at the moment, you are still seeing:

Warning messages: 1: In fread(paste0(prefix, ".best"), data.table = F, colClasses = c(FID = "character", : Detected 4 column names but the data has 3 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning. 2: In max(pheno$Pheno) : no non-missing arguments to max; returning -Inf

Right?

giuliapontali commented 4 years ago

Yes exactly. Also with the version 2.3.1.b

Begin plotting
Current Rscript version = 2.3.1.b
Plotting the quantile plot
WARNING: There are only 0 unique PRS but asked for 10 quantiles
Will not generate the quantile plot for  af.af
Plotting Bar Plot
Plotting the high resolution plot
Warning messages:
1: In fread(paste0(prefix, ".best"), data.table = F, colClasses = c(FID = "character",  :
  Detected 4 column names but the data has 3 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.
2: In max(pheno$Pheno) : no non-missing arguments to max; returning -Inf
choishingwan commented 4 years ago

I'm stomp. Mind sending me the best score file as an attachment? I will see how different form what I've got.

Are you also using 2.3.1.b executable? e.g. what's the output of ./PRSice_linux -v

On Tue, May 26, 2020 at 11:03 PM jPontix notifications@github.com wrote:

Yes exactly. Also with the version 2.3.1.b

Begin plotting Current Rscript version = 2.3.1.b Plotting the quantile plot WARNING: There are only 0 unique PRS but asked for 10 quantiles Will not generate the quantile plot for af.af Plotting Bar Plot Plotting the high resolution plot Warning messages: 1: In fread(paste0(prefix, ".best"), data.table = F, colClasses = c(FID = "character", : Detected 4 column names but the data has 3 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning. 2: In max(pheno$Pheno) : no non-missing arguments to max; returning -Inf

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/195#issuecomment-634082276, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYS65FU4G5REF3K5J5LRTPK4HANCNFSM4NAOBPFA .

choishingwan commented 4 years ago

And the full log from both the executable and the Rscript?

choishingwan commented 4 years ago

PRSice 2.3.1.b should always produce best file containing the both the FID and IID. Using the mock data you've previous sent me, I couldn't reproduce the same output. Could you please try running 2.3.1.b again but using a different --out parameter and see if that still generate the same error, esp if the best file contains both FID and IID column?

giuliapontali commented 4 years ago

what does it mean "different --out parameter"?

choishingwan commented 4 years ago

Instead of --out af.af try something like --out testing

just to make sure that we are using the new output

choishingwan commented 4 years ago

Did you get the same error? The link you sent me is empty

choishingwan commented 4 years ago

Unfortunately, with this, I am truly out of option. Unless I have access to your data, I don't think I can figure out what's the problem

On Wed, May 27, 2020 at 6:24 PM jPontix notifications@github.com wrote:

Sorry, now should be work: https://www.dropbox.com/sh/xs2ekt3o740klgm/AAA5jKasstTpo__k4abHuogqa?dl=0

yes, I get the same error

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/195#issuecomment-634570305, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYVUHWNSZGGG3ZDR3GDRTTS5JANCNFSM4NAOBPFA .

giuliapontali commented 4 years ago

ok, do you need also the genotyping data?

choishingwan commented 4 years ago

No, I'll only need your phenotype, covariate, fam file for now. It is most likely caused by either of those file (as you are having ID problem).

Do send the files to me to my personal email so that no one else can have access to it (github is public)

choishingwan commented 4 years ago

Think this was fixed in the new version. Will close this for now