choishingwan / PRSice

A software package for calculating, applying, evaluating and plotting the results of polygenic risk scores
http://prsice.info
GNU General Public License v3.0
187 stars 90 forks source link

Version 2.2.10_p-competitive values #158

Closed SaraBandres closed 5 years ago

SaraBandres commented 5 years ago

Hi Sam,

I have tried to run PRSet using the version 2.2.10 on my data and despite it gives me a summary, the p-competitive remains the same across > 2000 pathways. Also, the coefficient seems to be in the range of 400-1000 in a lot of pathways which I think is weird. When running it, it does not allow me to generate a plot. Do you know what could have happened? See below:

Processing 0.04%---- COMMAND EXECUTED: --------------------------------------------------------- ( Rscript /data/LNG/saraB/PRSice_linux_2.2.10/PRSice.R --cov PCA_all_data_concatenated.eigenvec --out /data/LNG/saraB/PATHWAYS/CURATED_2.2.10/CANONICAL/CURATED_SOFTCALLS_CLUMPING_r2_0.1_ALLVARIANTS_version.2.2.10 --target CORRECT_SEVENTY_SOFTCALLS_PD_unfiltered_no_duplicates_geno0.15_MAF0.05 -b nochr_METAANALYSIS_PdGeneAnd23AndMe1_Chang2017.tbl --beta --snp MarkerName --A1 Allele1 --A2 Allele2 --stat Effect --pvalue P-value --ld /data/LNG/pdMeta5v2/cojoRef/binaryForCojo --print-snp --score std --set-perm 1000 --prsice /data/LNG/saraB/PRSice_linux_2.2.10/PRSice_linux -n 48 --binary-target T --maf 1E-2 --quantile 4 --prevalence 0.005 --gtf /data/LNG/ALS_PATHWAYS/Homo_sapiens.GRCh37.75.gtf --msigdb /data/LNG/saraB/LISTS/c2.cp.v7.0.symbols.gmt --multi-plot 10 --full-back )

[bandrescigas@biowulf PATHWAYS]$ cat swarm_38421983_0* [+] Loading gcc 7.3.0 ... [+] Loading GSL 2.4 for GCC 7.2.0 ... [-] Unloading gcc 7.3.0 ... [+] Loading gcc 7.3.0 ... [+] Loading openmpi 3.0.2 for GCC 7.3.0 [+] Loading ImageMagick 7.0.8 on cn4250 [+] Loading HDF5 1.10.4 [+] Loading pandoc 2.1.1 on cn4250 [+] Loading R 3.6.0 [+] Loading plink 1.9.0-beta4.4 on cn4250 PRSice 2.2.10 (2019-10-08) https://github.com/choishingwan/PRSice (C) 2016-2019 Shing Wan (Sam) Choi and Paul F. O'Reilly GNU General Public License v3 If you use PRSice in any published work, please cite: Choi SW, O'Reilly PF. PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data. GigaScience 8, no. 7 (July 1, 2019) 2019-10-08 18:56:02 /data/LNG/saraB/PRSice_linux_2.2.10/PRSice_linux \ --A1 Allele1 \ --A2 Allele2 \ --bar-levels 1 \ --base nochr_METAANALYSIS_PdGeneAnd23AndMe1_Chang2017.tbl \ --beta \ --binary-target T \ --clump-kb 1000 \ --clump-p 1.000000 \ --clump-r2 0.100000 \ --cov PCA_all_data_concatenated.eigenvec \ --fastscore \ --feature exon,gene,protein_coding,CDS \ --gtf /data/LNG/ALS_PATHWAYS/Homo_sapiens.GRCh37.75.gtf \ --ld /data/LNG/pdMeta5v2/cojoRef/binaryForCojo \ --maf 0.01 \ --msigdb /data/LNG/saraB/LISTS/c2.cp.v7.0.symbols.gmt \ --out /data/LNG/saraB/PATHWAYS/CURATED_2.2.10/CANONICAL/CURATED_SOFTCALLS_CLUMPING_r2_0.1_ALLVARIANTS_version.2.2.10 \ --prevalence 0.005 \ --print-snp \ --pvalue P-value \ --score std \ --seed 3317001963 \ --set-perm 1000 \ --snp MarkerName \ --stat Effect \ --target CORRECT_SEVENTY_SOFTCALLS_PD_unfiltered_no_duplicates_geno0.15_MAF0.05 \ --thread 48

Initializing Genotype file: CORRECT_SEVENTY_SOFTCALLS_PD_unfiltered_no_duplicates_geno0.15_MAF0.05 (bed)

Start processing nochr_METAANALYSIS_PdGeneAnd23AndMe1_Chang2017 ==================================================

Reading 100.00% Base file: nochr_METAANALYSIS_PdGeneAnd23AndMe1_Chang2017.tbl 7909452 variant(s) observed in base file, with: 1173297 ambiguous variant(s) excluded 6736155 total variant(s) included from base file

Loading Genotype info from target ==================================================

8512 people (4671 male(s), 3841 female(s)) observed 8512 founder(s) included

437580 variant(s) not found in previous data 7 ambiguous variant(s) excluded 1913884 variant(s) included

Start processing reference

Initializing Genotype file: /data/LNG/pdMeta5v2/cojoRef/binaryForCojo (bed)

Loading Genotype info from reference ==================================================

40063 people (22034 male(s), 18029 female(s)) observed 40063 founder(s) included

9613246 variant(s) not found in previous data 1834549 variant(s) included

Calculate MAF and perform filtering on target SNPs ==================================================

Calculating allele frequencies: 100.00% 1834549 variant(s) included

Start processing gene set information ==================================================

Loading /data/LNG/saraB/LISTS/c2.cp.v7.0.symbols.gmt (MSigDB)

Loading GTF file: /data/LNG/ALS_PATHWAYS/Homo_sapiens.GRCh37.75.gtf

A total of 666123 entries removed due to feature selection A total of 251346 entries removed as they are not on autosomal chromosome

A total of 2199 regions plus the base region are included

There are a total of 1 phenotype to process

Start performing clumping

1547809 MB RAM detected; reserving 82 MB for clumping

Allocated 82 MB successfully

Clumping Progress: 100.00%

Number of variant(s) after clumping : 61037

Processing the 1 th phenotype Phenotype is a binary phenotype 3595 control(s) 4917 case(s)

Processing the covariate file: PCA_all_data_concatenated.eigenvec ==============================

Include Covariates: Name Missing Number of levels -0.000960801 0 - -0.00551651 0 - -0.00268669 0 - 0.00567819 0 - 0.00229551 0 - 0.00597221 0 - 0.00446047 0 - -0.000665653 0 - 0.000953485 0 - 0.00213915 0 - 0.00296924 0 - 0.000708186 0 - 0.00174007 0 - 0.0035003 0 - -0.00349865 0 - -0.00281303 0 - -0.00321233 0 - -0.00340139 0 - -0.00415554 0 - -0.000502068 0 -

After reading the covariate file, 8512 sample(s) included in the analysis

Preparing Output Files

Start Processing Processing 0.09%Start competitive permutation

Warning: To speed up the permutation, linear regression instead of logistic regression were performed within the permutation, and constructs the null distribution using the absolute z-scores. This is based on the assumption that linear Regression & logistic regression should produce similar absolute z-scores. In addition, the regression equation changed from Phenotype~PRS+Covariates to PRS~Phenotype+Covariate. This two equations should generate simliar z-score for the independent variable and will allow us to perform some optimizations to speed up the permutation

Running permutation with 48 threads

Processing 100.00% There are 1355 region(s)/phenotype(s) with p-value > 0.1 (not significant);751 region(s) with p-value between 0.1 and 1e-5 (may not be significant); and 93 region(s) with p-value less than 1e-5.

Error in names(x) <- value : 'names' attribute [3] must be the same length as the vector [2] Calls: process_plot -> colnames<- Execution halted

SaraBandres commented 5 years ago

Here's a screeshot of the results. The top pathways with better self-associated p-value make totally sense to me and seem correct. I think the problem is in the p-competitive value and the coefficient. Thanks a lot for your help.

https://drive.google.com/file/d/1T0wBBsggU0iB7tNB3VY2DBwdhdFJkGjP/view?usp=sharing

choishingwan commented 5 years ago

Thank you for the info. Just found out that due to some mistakes, PRSet and PRSice thought that the number of permutation to perform = 0, therefore all the results are highly significant (and that might also be why you can get the result so quickly). I have now updated PRSice to 2.2.10.b, which should have the permutation fixed.

https://github.com/choishingwan/PRSice/releases/tag/2.2.10.b

As for the coefficient and standard error, it is actually rather normal to get a huge coefficient and standard error. The interpretation of the coefficient = how much will the phenotype increase when there is one unit increase in PRS. As the PRS are so tiny, we do expect a huge change in the phenotype if PRS are increased by one, so that’s not at all surprising to get a huge coefficient / standard error.

Do let me know if you found any other problems.

Sam

SaraBandres commented 5 years ago

Thanks a lot Sam. Yes it is now working fine. Just a last question. I have run the same analyses for age at onset with the 2.2.10.b version, and although it works fine, I do not get a competitive p-value but an empirical p-value instead. Is that what I expect to see for continuous variables? This is the code that I am running.

Rscript /data/LNG/saraB/PRSice_linux_2.2.10b/PRSice.R --cov-file PCA_AAO --out /data/LNG/saraB/PATHWAYS/CURATED_2.2.10b/covs_corrected_nobarlevels/CURATED_AGEATONSET_CLUMPING_r2_0.1_version.2.2.10 -t SEVENTY_CASES_AAO_SPAIN_ALL -b nochr_AAO_GWAS_META_NO_SPAIN31.tbl --beta --pheno-file PHENOforPaths.txt --snp MarkerName --A1 Allele1 --A2 Allele2 --stat Effect --pvalue P-value --ld /data/LNG/pdMeta5v2/cojoRef/binaryForCojo --print-snp --score std --perm 1000 --prsice /data/LNG/saraB/PRSice_linux_2.2.10b/PRSice_linux -n 48 --maf 1E-2 --fastscore --gtf /data/LNG/ALS_PATHWAYS/Homo_sapiens.GRCh37.75.gtf --msigdb /data/LNG/saraB/LISTS/c2.cp.v7.0.symbols.gmt --multi-plot 10 --enable-mmap --full-back

choishingwan commented 5 years ago

That’s because you used --perm instead of --set-perm

SaraBandres commented 5 years ago

great, thanks!