choishingwan / PRSice

A software package for calculating, applying, evaluating and plotting the results of polygenic risk scores
http://prsice.info
GNU General Public License v3.0
182 stars 86 forks source link

question about .snp #152

Closed sarasaezALS closed 4 years ago

sarasaezALS commented 5 years ago

from the file.snp generate I get something like this:

CHR SNP BP P Base Background hits
9 rs41272879 27524980 0.4057 N N Y
9 rs116899991 27524993 0.08619 N N Y
9 rs41272883 27525815 1.50E-01 N N Y
9 rs700782 27526047 1.33E-16 N N Y
9 rs3739527 27526480 0.4062 N N Y

what I can understand is that those snps are not presented in the base file nor in the background. Why have they being taking for the prset calculation (column hits)?

choishingwan commented 5 years ago

(Sorry, usually need to take care of my daughter during weekends)

That's because of clumping

Let's assume there's two gene, Gene A and Gene B, with only gene B located within our target pathway. Let's then assume each of these gene has 1 SNP, SNP A and SNP B respectively, with SNP A being more significant than SNP B and they are in LD.

In PRSet, the Base set is the whole genome set, construct using all SNPs without taking into account of their gene membership; the background set is the genic set by default (you can change that using --background), constructed using all genes within the genome.

During clumping, both the base and the background set will contain both SNP A and SNP B, thus SNP B will be clumped out. However, for the target pathway, as Gene A isn't included, SNP A will be missing. Assuming there's no other SNPs in the pathway that are more significant than SNP B, SNP B will remain. This can therefore lead to the current observation, where there are SNPs that are located within your target pathway (hits) but not Base and background

(Side note: 2.2.7 should have increased speed for --set-perm. You can have a try and see if that does help to increase the speed)

sarasaezALS commented 5 years ago

I see. when I perform the analysis using a list of about 5000 pathways, in both training and replication, I found significant competitive p-value. When I run just the pathways independently (instead of using the file containing 5000 pathways, I run one by one) the results are different. The number of SNPs in the base changes and the number of SNPs in the pathway too. How could this be explained? Why the results are not consistent and they vary depend on the number of pathways?

choishingwan commented 5 years ago

I’ve previously got a similar report, but I was unable to reproduce it. Could you please state the version number you are using and show me the log and summary file? -- Dr Shing Wan Choi Postdoctoral Fellow Genetics and Genomic Sciences Icahn School of Medicine, Mount Sinai, NYC

sarasaezALS commented 5 years ago

The version is 2.6.6 (I have tried the last updated one but the execution halted unexpected). these are the files: SARA_TRAINING_hard_mf001_15KB.log TRAINING_c9orfLD_righticolistgenefusion.log TRAINING_c9orfLD_righticolistgenefusionsummary.txt SARA_TRAINING_hard_mf001_15KBsummary.txt

I have run several list from MsigDB, each of them containing a different number of pathways. The number of SNPs in the base is always different too.

choishingwan commented 5 years ago

Let me try and see if I can reproduce this with the UK biobank data. Will let you know of the results.

On Mon, Sep 16, 2019 at 9:06 AM sarasaezALS notifications@github.com wrote:

The version is 2.6.6 (I have tried the last updated one but the execution halted unexpected). these are the files: SARA_TRAINING_hard_mf001_15KB.log https://github.com/choishingwan/PRSice/files/3616356/SARA_TRAINING_hard_mf001_15KB.log TRAINING_c9orfLD_righticolistgenefusion.log https://github.com/choishingwan/PRSice/files/3616359/TRAINING_c9orfLD_righticolistgenefusion.log TRAINING_c9orfLD_righticolistgenefusionsummary.txt https://github.com/choishingwan/PRSice/files/3616361/TRAINING_c9orfLD_righticolistgenefusionsummary.txt SARA_TRAINING_hard_mf001_15KBsummary.txt https://github.com/choishingwan/PRSice/files/3616365/SARA_TRAINING_hard_mf001_15KBsummary.txt

I have run several list from MsigDB, each of them containing a different number of pathways. The number of SNPs in the base is always different too.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/152?email_source=notifications&email_token=AAJTRYQVIDFYLGWZYCBPWZDQJ6AGBA5CNFSM4IWVS7NKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ZCIUQ#issuecomment-531768402, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJTRYSTDWGT2NAXBGS2D2TQJ6AGBANCNFSM4IWVS7NA .

choishingwan commented 5 years ago

First thing first though. Your covariate file does need a header, otherwise the first sample will always be missing from your analysis (FID = 1, IID = 76)

choishingwan commented 5 years ago

Have now managed to reproduce the problem. Will now look into it.

sarasaezALS commented 5 years ago

OK, so there is a problem. Should I continue with the analysis or just stop until knowing something?

choishingwan commented 5 years ago

Preferably give me sometime to ensure that the current result is correct. I suspect that's something to do with the flag based clumping. Hopefully I can fix that within today or before end of tomorrow.

sarasaezALS commented 5 years ago

so I will wait until you figure out what is going on. Best, Sara

choishingwan commented 5 years ago

That's a relatively stupid error on my side and should've been fixed now in the latest release. Please check and see if PRSet now generate the same number of SNPs for base and background with different set input.

(Side note: This bug also affects the clumping of other sets, it's just that it's more difficult to detect. So please do re-run the PRSet analyses)

sarasaezALS commented 5 years ago

Great!!!! it is good to know! so, is there a new version available?

choishingwan commented 5 years ago

Yes, 2.2.8, which is available now. You can find it here

sarasaezALS commented 5 years ago

the new version 2.8.8 (as happened to me with version 2.2.7) is being killed. the .log does not say anything. From the job, it says: +] Loading gcc 7.3.0 ... [+] Loading GSL 2.4 for GCC 7.2.0 ... [-] Unloading gcc 7.3.0 ... [+] Loading gcc 7.3.0 ... [+] Loading openmpi 3.0.2 for GCC 7.3.0 [+] Loading ImageMagick 7.0.8 on cn2907 [+] Loading HDF5 1.10.4 [+] Loading pandoc 2.1.1 on cn2907 [+] Loading R 3.6.0 [-] Unloading plink 1.9.0-beta4.4 on cn2907 [+] Loading plink 1.9.0-beta4.4 on cn2907 PRSice 2.2.8 (2019-09-16) https://github.com/choishingwan/PRSice (C) 2016-2019 Shing Wan (Sam) Choi and Paul F. O'Reilly GNU General Public License v3 If you use PRSice in any published work, please cite: Choi SW, O'Reilly PF. PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data. GigaScience 8, no. 7 (July 1, 2019) 2019-09-16 16:15:02 /data/LNG/ALS_PATHWAYS/PRSice2.2.8/PRSice_linux-7/PRSice_linux \ --A1 A1 \ --A2 A2 \ --bar-levels 1 \ --base /data/LNG/ALS_PATHWAYS/Summary_Statistics_GWAS_2016/vanRheenen_meta_summarystats_SaraSaez_TEST.tab \ --beta \ --binary-target T \ --clump-kb 1000 \ --clump-p 1.000000 \ --clump-r2 0.100000 \ --cov /data/LNG/ALS_PATHWAYS/ALL_COVS_ALS.txt \ --fastscore \ --feature exon,gene,protein_coding,CDS \ --gtf /data/LNG/ALS_PATHWAYS/Homo_sapiens.GRCh37.75.gtf \ --ld /data/LNG/ALS_PATHWAYS/workingData \ --maf 0.01 \ --missing MEAN_IMPUTE \ --model add \ --msigdb /data/LNG/ALS_PATHWAYS/c5.mf.v6.2.symbols.gmt.txt \ --out /data/LNG/ALS_PATHWAYS/PRSice2.2.8/TRAINING_mf-wholelist \ --prevalence 0.00005 \ --print-snp \ --pvalue p \ --score std \ --seed 2879789031 \ --set-perm 1000 \ --snp SNP \ --stat b \ --target /data/LNG/ALS_PATHWAYS/ALL.ALS.unrelated_PHENO_SEX_SEVENTY \ --thread 12

Initializing Genotype file: /data/LNG/ALS_PATHWAYS/ALL.ALS.unrelated_PHENO_SEX_SEVENTY (bed)

Start processing meta ==================================================

Reading 100.00% Base file: /data/LNG/ALS_PATHWAYS/Summary_Statistics_GWAS_2016/meta 8709452 variant(s) observed in base file, with: 20 duplicated variant(s) 1 NA stat/p-value observed 1241861 ambiguous variant(s) excluded 7467570 total variant(s) included from base file

Loading Genotype info from target ==================================================

30282 people (10754 male(s), 19528 female(s)) observed 30282 founder(s) included

6438333 variant(s) not found in previous data 867 ambiguous variant(s) excluded 5195875 variant(s) included

Start processing reference

Initializing Genotype file: /data/LNG/ALS_PATHWAYS/workingData (bed)

Loading Genotype info from reference ==================================================

43257 people (15332 male(s), 27925 female(s)) observed 43257 founder(s) included

6440385 variant(s) not found in previous data 5195875 variant(s) remained

Calculate MAF and perform filtering on target SNPs ==================================================

Calculating allele frequencies: 0.00%Error: Execution halted

is any command new in the new version?

choishingwan commented 5 years ago

No, this is strange. Though I did introduced something in 2.2.7, but I don't expect them to have any negative impact to the processing. I will have to see if I can reproduce this error. Will update you tomorrow. Thanks

On Mon, Sep 16, 2019 at 4:24 PM sarasaezALS notifications@github.com wrote:

the new version 2.8.8 (as happened to me with version 2.2.7) is being killed. the .log does not say anything. From the job, it says: +] Loading gcc 7.3.0 ... [+] Loading GSL 2.4 for GCC 7.2.0 ... [-] Unloading gcc 7.3.0 ... [+] Loading gcc 7.3.0 ... [+] Loading openmpi 3.0.2 for GCC 7.3.0 [+] Loading ImageMagick 7.0.8 on cn2907 [+] Loading HDF5 1.10.4 [+] Loading pandoc 2.1.1 on cn2907 [+] Loading R 3.6.0 [-] Unloading plink 1.9.0-beta4.4 on cn2907 [+] Loading plink 1.9.0-beta4.4 on cn2907 PRSice 2.2.8 (2019-09-16) https://github.com/choishingwan/PRSice (C) 2016-2019 Shing Wan (Sam) Choi and Paul F. O'Reilly GNU General Public License v3 If you use PRSice in any published work, please cite: Choi SW, O'Reilly PF. PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data. GigaScience 8, no. 7 (July 1, 2019) 2019-09-16 16:15:02 /data/LNG/ALS_PATHWAYS/PRSice2.2.8/PRSice_linux-7/PRSice_linux --A1 A1 --A2 A2 --bar-levels 1 --base /data/LNG/ALS_PATHWAYS/Summary_Statistics_GWAS_2016/vanRheenen_meta_summarystats_SaraSaez_TEST.tab

--beta --binary-target T --clump-kb 1000 --clump-p 1.000000 --clump-r2 0.100000 --cov /data/LNG/ALS_PATHWAYS/ALL_COVS_ALS.txt --fastscore --feature exon,gene,protein_coding,CDS --gtf /data/LNG/ALS_PATHWAYS/Homo_sapiens.GRCh37.75.gtf --ld /data/LNG/ALS_PATHWAYS/workingData --maf 0.01 --missing MEAN_IMPUTE --model add --msigdb /data/LNG/ALS_PATHWAYS/c5.mf.v6.2.symbols.gmt.txt --out /data/LNG/ALS_PATHWAYS/PRSice2.2.8/TRAINING_mf-wholelist --prevalence 0.00005 --print-snp --pvalue p --score std --seed 2879789031 --set-perm 1000 --snp SNP --stat b --target /data/LNG/ALS_PATHWAYS/ALL.ALS.unrelated_PHENO_SEX_SEVENTY --thread 12

Initializing Genotype file: /data/LNG/ALS_PATHWAYS/ALL.ALS.unrelated_PHENO_SEX_SEVENTY (bed) Start processing meta

Reading 100.00% Base file: /data/LNG/ALS_PATHWAYS/Summary_Statistics_GWAS_2016/meta 8709452 variant(s) observed in base file, with: 20 duplicated variant(s) 1 NA stat/p-value observed 1241861 ambiguous variant(s) excluded 7467570 total variant(s) included from base file Loading Genotype info from target

30282 people (10754 male(s), 19528 female(s)) observed 30282 founder(s) included

6438333 variant(s) not found in previous data 867 ambiguous variant(s) excluded 5195875 variant(s) included

Start processing reference

Initializing Genotype file: /data/LNG/ALS_PATHWAYS/workingData (bed) Loading Genotype info from reference

43257 people (15332 male(s), 27925 female(s)) observed 43257 founder(s) included

6440385 variant(s) not found in previous data 5195875 variant(s) remained Calculate MAF and perform filtering on target SNPs

Calculating allele frequencies: 0.00%Error: Execution halted

is any command new in the new version?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/152?email_source=notifications&email_token=AAJTRYWHWNHYQ7FPSIWZ24DQJ7TODA5CNFSM4IWVS7NKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD62M5IA#issuecomment-531943072, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJTRYQXMB4IOF3SZCLRSE3QJ7TODANCNFSM4IWVS7NA .

sarasaezALS commented 5 years ago

OK, thanks. it is the same that happen to me in 2.2.7...it didn't say so much....just stop at the same point

choishingwan commented 5 years ago

Done, 2.2.9 should have fixed that. I accidentally left some experimental code within the MAF calculation, which have caused the issue

sarasaezALS commented 5 years ago

great! I have just downloaded. Thanks. with --extract SNPs, it is possible to force them to prevailed after the clumping, I mean, it is possible to force any particular SNP to be in the post-clumped SNPs?

choishingwan commented 5 years ago

No, this isn't possible unless I implement such feature (though that's very unlikely). If you really want to force a SNP to be included in post clump scenario, just give it the smallest p-value as we only use the p-value for clumping and thresholding. For PRSet, as you will be using p-value threshold of 1, that shouldn't affect your result.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.