choishingwan / PRSice

A software package for calculating, applying, evaluating and plotting the results of polygenic risk scores
http://prsice.info
GNU General Public License v3.0
187 stars 90 forks source link

mismatched variants? #30 re-open issue #69

Closed ghost closed 5 years ago

ghost commented 6 years ago

Following previous issue #30

I am using BGEN and I am having the same issue:

6330995 variant(s) observed in base file, with: 6330995 variant(s) not found in target file 0 total variant(s) included from base file Error: No valid variant remaining

I've checked internally and I have matching rs numbers and positions. I downloaded PRSice the 21st May. So not sure whether you fixed this before such date?

Thanks!

choishingwan commented 6 years ago

Copy and paste the comment I posted in issue #30 Could you include the full log of your process? If you are using --extract or --exclude, could you please make sure those file ain't empty? There is a similar discussion here and you can find a hot fix version of PRSice here. This hot fix dealt with problem of the .valid file.

ghost commented 6 years ago

Hi Sam,

It works now. Thank you!!! One suggestion: It would be useful to add the run name to the PRSice.valid file: This way PRSice doesn't over-write when running different/multiple jobs - it would make things a bit easier to track or debug.

Cheers, all the best.

Judit

choishingwan commented 6 years ago

You can use the --out parameter to specify the prefix of all output of PRSice. The default, when --out isn't given, is PRSice

carbocation commented 6 years ago

Would it make sense to keep this issue open until the hotfix is merged into a release? Or alternatively, if that hotfix already made it into a release, then I'm still having the same issues with bgen files.

choishingwan commented 6 years ago

Could you please show me the log file? Which version are you using? This specific problem should be fixed in the latest updates.

carbocation commented 6 years ago

I'm running this now with the "hotfix" and it works fine. It does not work with the release that I downloaded from https://choishingwan.github.io/PRSice/

I am not sure how they differ, but running a binary diff, I can tell you that they do: Binary files /home/unix/jamesp/bin/PRSice_linux and /home/unix/jamesp/lib/PRSice/PRSice_linux differ

The hotfix has the following at the start of its log:

PRSice 2.1.2.beta (31 May 2018) 
https://github.com/choishingwan/PRSice
(C) 2016-2017 Shing Wan (Sam) Choi, Jack Euesden, Cathryn M. Lewis, Paul F. O'Reilly
GNU General Public License v3

If you use PRSice in any published work, please cite:
Jack Euesden Cathryn M. Lewis Paul F. O'Reilly (2015)
PRSice: Polygenic Risk Score software.
Bioinformatics 31 (9): 1466-1468

The non-hotfix has the following at the start of its log:

PRSice 2.1.3.beta (21 August 2018) 
https://github.com/choishingwan/PRSice
(C) 2016-2017 Shing Wan (Sam) Choi, Jack Euesden, Cathryn M. Lewis, Paul F. O'Reilly
GNU General Public License v3

If you use PRSice in any published work, please cite:
Jack Euesden Cathryn M. Lewis Paul F. O'Reilly (2015)
PRSice: Polygenic Risk Score software.
Bioinformatics 31 (9): 1466-1468
carbocation commented 6 years ago

And the full log from the 2.1.3beta that is not working for me with bgens. (Identifying paths and filenames have been modified; otherwise, this is intact.)

PRSice 2.1.3.beta (21 August 2018) 
https://github.com/choishingwan/PRSice
(C) 2016-2017 Shing Wan (Sam) Choi, Jack Euesden, Cathryn M. Lewis, Paul F. O'Reilly
GNU General Public License v3

If you use PRSice in any published work, please cite:
Jack Euesden Cathryn M. Lewis Paul F. O'Reilly (2015)
PRSice: Polygenic Risk Score software.
Bioinformatics 31 (9): 1466-1468

2018-08-21 19:08:53
PRSice_linux \
    --A1 effect_allele \
    --A2 noneffect_allele \
    --all-score  \
    --bar-levels 1 \
    --base pheno.txt \
    --beta  \
    --binary-target T \
    --bp bp_hg19 \
    --chr chr \
    --clump-kb 0 \
    --clump-p 1.000000 \
    --clump-r2 0.100000 \
    --extract chr22.valid \
    --hard-thres 0.900000 \
    --info-base median_info,0.9 \
    --interval 0.000050 \
    --lower 0.000100 \
    --model add \
    --no-default  \
    --no-regress  \
    --out chr22 \
    --pheno-file sample.sample \
    --pvalue p_dgc \
    --se se_dgc \
    --seed 569377328 \
    --snp markername \
    --stat beta \
    --target chr22 \
    --thread 2 \
    --type bgen \
    --upper 0.500000

Loading Genotype file: 
chr22 
(bgen) 

Detected bgen sample file format
487409 people (0 male(s), 0 female(s)) observed 
487409 founder(s) included 

SNP extraction/exclusion list contains 5 columns, will 
assume first column contains the SNP ID 

1255K SNPs processed in chr22.bgen
1576 variant(s) included 

1 region included 

Check Phenotype file: 
sample.sample 
Column Name of Sample ID: ID_1+ID_2 
Note: If the phenotype file does not contain a header, the 
column name will be displayed as the Sample ID which is ok. 
Phenotype Name: missing 
There are a total of 1 phenotype to process 

Start processing pheno 
============================== 

Reading 100.00%
Base file: pheno.txt 
9455778 variant(s) observed in base file, with: 
9455778 variant(s) not found in target file 
0 total variant(s) included from base file 

Error: No valid variant remaining
choishingwan commented 6 years ago

Can you check if your base file actually contain those SNP IDs? On Wed, 22 Aug 2018 at 12:34 AM, James Pirruccello notifications@github.com wrote:

And the full log from the 2.1.3beta that is not working for me with bgens:

PRSice 2.1.3.beta (21 August 2018) https://github.com/choishingwan/PRSice (C) 2016-2017 Shing Wan (Sam) Choi, Jack Euesden, Cathryn M. Lewis, Paul F. O'Reilly GNU General Public License v3

If you use PRSice in any published work, please cite: Jack Euesden Cathryn M. Lewis Paul F. O'Reilly (2015) PRSice: Polygenic Risk Score software. Bioinformatics 31 (9): 1466-1468

2018-08-21 19:08:53 PRSice_linux \ --A1 effect_allele \ --A2 noneffect_allele \ --all-score \ --bar-levels 1 \ --base pheno.txt \ --beta \ --binary-target T \ --bp bp_hg19 \ --chr chr \ --clump-kb 0 \ --clump-p 1.000000 \ --clump-r2 0.100000 \ --extract chr22.valid \ --hard-thres 0.900000 \ --info-base median_info,0.9 \ --interval 0.000050 \ --lower 0.000100 \ --model add \ --no-default \ --no-regress \ --out chr22 \ --pheno-file sample.sample \ --pvalue p_dgc \ --se se_dgc \ --seed 569377328 \ --snp markername \ --stat beta \ --target chr22 \ --thread 2 \ --type bgen \ --upper 0.500000

Loading Genotype file: chr22 (bgen)

Detected bgen sample file format 487409 people (0 male(s), 0 female(s)) observed 487409 founder(s) included

SNP extraction/exclusion list contains 5 columns, will assume first column contains the SNP ID

1255K SNPs processed in chr22.bgen 1576 variant(s) included

1 region included

Check Phenotype file: sample.sample Column Name of Sample ID: ID_1+ID_2 Note: If the phenotype file does not contain a header, the column name will be displayed as the Sample ID which is ok. Phenotype Name: missing There are a total of 1 phenotype to process

Start processing pheno

Reading 100.00% Base file: pheno.txt 9455778 variant(s) observed in base file, with: 9455778 variant(s) not found in target file 0 total variant(s) included from base file

Error: No valid variant remaining

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/69#issuecomment-414855564, or mute the thread https://github.com/notifications/unsubscribe-auth/ABM44vN9ria-PojklJ77T7ZuTmB9OL6Pks5uTJjrgaJpZM4VTLGn .

carbocation commented 6 years ago

Yes - the May hotfix is running fine on the ~60,000 variants that overlap this GWAS data from chromosome 22 in my bgens. The August 21 version does not seem to work for me.

choishingwan commented 6 years ago

Do you still have the log file from May? Here, it suggested that there’s only around 1500 SNPs left after filtering. The main difference between the May version and the August version is that the info filtering and MAF filtering, so SNPs should be filtered out correctly On Wed, 22 Aug 2018 at 12:49 AM, James Pirruccello notifications@github.com wrote:

Yes - the May hotfix is running fine on the ~60,000 variants that overlap this GWAS data from chromosome 22 in my bgens. The August 21 version does not seem to work for me.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/69#issuecomment-414858482, or mute the thread https://github.com/notifications/unsubscribe-auth/ABM44mGCxaMYz2zH8kZukOS_LXVUtcg4ks5uTJyYgaJpZM4VTLGn .

carbocation commented 6 years ago

The May version (i.e., the hotfix which I just downloaded from your Dropbox) is still in the process of running:

PRSice 2.1.2.beta (31 May 2018) 
https://github.com/choishingwan/PRSice
(C) 2016-2017 Shing Wan (Sam) Choi, Jack Euesden, Cathryn M. Lewis, Paul F. O'Reilly
GNU General Public License v3

If you use PRSice in any published work, please cite:
Jack Euesden Cathryn M. Lewis Paul F. O'Reilly (2015)
PRSice: Polygenic Risk Score software.
Bioinformatics 31 (9): 1466-1468

2018-08-21 19:30:50
PRSice_linux \
    --A1 effect_allele \
    --A2 noneffect_allele \
    --all-score  \
    --bar-levels 1 \
    --base pheno.txt \
    --beta  \
    --binary-target T \
    --bp bp_hg19 \
    --chr chr \
    --info-base median_info,0.9 \
    --interval 0.000050 \
    --lower 0.000100 \
    --model add \
    --no-clump  \
    --no-default  \
    --no-regress  \
    --out chr22 \
    --pheno-file sample.sample \
    --pvalue p_dgc \
    --se se_dgc \
    --seed 3898643461 \
    --snp markername \
    --stat beta \
    --target chr22 \
    --thread 2 \
    --type bgen \
    --upper 0.500000

Loading Genotype file: 
chr22 
(bgen) 

Detected bgen sample file format
487409 people (0 male(s), 0 female(s)) observed 
487409 founder(s) included 

1255K SNPs processed in chr22.bgen
Error: A total of 4263 duplicated SNP ID detected out of 
       1082409 input SNPs!. Valid SNP ID stored at chr22.valid. 
       You can avoid this error by using --extract chr22.valid 

PRSice 2.1.2.beta (31 May 2018) 
https://github.com/choishingwan/PRSice
(C) 2016-2017 Shing Wan (Sam) Choi, Jack Euesden, Cathryn M. Lewis, Paul F. O'Reilly
GNU General Public License v3

If you use PRSice in any published work, please cite:
Jack Euesden Cathryn M. Lewis Paul F. O'Reilly (2015)
PRSice: Polygenic Risk Score software.
Bioinformatics 31 (9): 1466-1468

2018-08-21 19:31:12
PRSice_linux \
    --A1 effect_allele \
    --A2 noneffect_allele \
    --all-score  \
    --bar-levels 1 \
    --base pheno.txt \
    --beta  \
    --binary-target T \
    --bp bp_hg19 \
    --chr chr \
    --extract chr22.valid \
    --info-base median_info,0.9 \
    --interval 0.000050 \
    --lower 0.000100 \
    --model add \
    --no-clump  \
    --no-default  \
    --no-regress  \
    --out chr22 \
    --pheno-file sample.sample \
    --pvalue p_dgc \
    --se se_dgc \
    --seed 3104147554 \
    --snp markername \
    --stat beta \
    --target chr22 \
    --thread 2 \
    --type bgen \
    --upper 0.500000

Loading Genotype file: 
chr22 
(bgen) 

Detected bgen sample file format
487409 people (0 male(s), 0 female(s)) observed 
487409 founder(s) included 

1255K SNPs processed in chr22.bgen
1074860 variant(s) included 

1 region included 

Start processing pheno 
============================== 

Reading 100.00%
Base file: pheno.txt 
9455778 variant(s) observed in base file, with: 
3 ambiguous variant(s) excluded 
9358122 variant(s) not found in target file 
1115 mismatched variant(s) excluded 
30641 variant(s) with INFO score less than 0.900000 
66256 total variant(s) included from base file 

Warning: Mismatched SNPs detected between base and 
         target!You should check the files are based on the same 
         genome build, or that can just be InDels 

Check Phenotype file: 
sample.sample 
Column Name of Sample ID: ID_1+ID_2 
Note: If the phenotype file does not contain a header, the 
column name will be displayed as the Sample ID which is ok. 
Phenotype Name: missing 
There are a total of 1 phenotype to process 

Processing the 1 th phenotype
Processing 0.23%
choishingwan commented 6 years ago

Strange, can you check the valid file is the same for both version? For he August version it seems to only contain 1500 SNPs where’s for the May version it contains much more

(Won’t be able to reply after this email until tomorrow morning ) On Wed, 22 Aug 2018 at 1:00 AM, James Pirruccello notifications@github.com wrote:

The May version (i.e., the hotfix which I just downloaded from your Dropbox) is still in the process of running:

PRSice 2.1.2.beta (31 May 2018) https://github.com/choishingwan/PRSice (C) 2016-2017 Shing Wan (Sam) Choi, Jack Euesden, Cathryn M. Lewis, Paul F. O'Reilly GNU General Public License v3

If you use PRSice in any published work, please cite: Jack Euesden Cathryn M. Lewis Paul F. O'Reilly (2015) PRSice: Polygenic Risk Score software. Bioinformatics 31 (9): 1466-1468

2018-08-21 19:30:50 PRSice_linux \ --A1 effect_allele \ --A2 noneffect_allele \ --all-score \ --bar-levels 1 \ --base pheno.txt \ --beta \ --binary-target T \ --bp bp_hg19 \ --chr chr \ --info-base median_info,0.9 \ --interval 0.000050 \ --lower 0.000100 \ --model add \ --no-clump \ --no-default \ --no-regress \ --out chr22 \ --pheno-file sample.sample \ --pvalue p_dgc \ --se se_dgc \ --seed 3898643461 \ --snp markername \ --stat beta \ --target chr22 \ --thread 2 \ --type bgen \ --upper 0.500000

Loading Genotype file: chr22 (bgen)

Detected bgen sample file format 487409 people (0 male(s), 0 female(s)) observed 487409 founder(s) included

1255K SNPs processed in chr22.bgen Error: A total of 4263 duplicated SNP ID detected out of 1082409 input SNPs!. Valid SNP ID stored at chr22.valid. You can avoid this error by using --extract chr22.valid

PRSice 2.1.2.beta (31 May 2018) https://github.com/choishingwan/PRSice (C) 2016-2017 Shing Wan (Sam) Choi, Jack Euesden, Cathryn M. Lewis, Paul F. O'Reilly GNU General Public License v3

If you use PRSice in any published work, please cite: Jack Euesden Cathryn M. Lewis Paul F. O'Reilly (2015) PRSice: Polygenic Risk Score software. Bioinformatics 31 (9): 1466-1468

2018-08-21 19:31:12 PRSice_linux \ --A1 effect_allele \ --A2 noneffect_allele \ --all-score \ --bar-levels 1 \ --base pheno.txt \ --beta \ --binary-target T \ --bp bp_hg19 \ --chr chr \ --extract chr22.valid \ --info-base median_info,0.9 \ --interval 0.000050 \ --lower 0.000100 \ --model add \ --no-clump \ --no-default \ --no-regress \ --out chr22 \ --pheno-file sample.sample \ --pvalue p_dgc \ --se se_dgc \ --seed 3104147554 \ --snp markername \ --stat beta \ --target chr22 \ --thread 2 \ --type bgen \ --upper 0.500000

Loading Genotype file: chr22 (bgen)

Detected bgen sample file format 487409 people (0 male(s), 0 female(s)) observed 487409 founder(s) included

1255K SNPs processed in chr22.bgen 1074860 variant(s) included

1 region included

Start processing pheno

Reading 100.00% Base file: pheno.txt 9455778 variant(s) observed in base file, with: 3 ambiguous variant(s) excluded 9358122 variant(s) not found in target file 1115 mismatched variant(s) excluded 30641 variant(s) with INFO score less than 0.900000 66256 total variant(s) included from base file

Warning: Mismatched SNPs detected between base and target!You should check the files are based on the same genome build, or that can just be InDels

Check Phenotype file: sample.sample Column Name of Sample ID: ID_1+ID_2 Note: If the phenotype file does not contain a header, the column name will be displayed as the Sample ID which is ok. Phenotype Name: missing There are a total of 1 phenotype to process

Processing the 1 th phenotype Processing 0.23%

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/69#issuecomment-414860598, or mute the thread https://github.com/notifications/unsubscribe-auth/ABM44vvcQqoyb_stje21c4m52JUy9vT2ks5uTJ81gaJpZM4VTLGn .

carbocation commented 6 years ago

I can confirm that it is the same file, in particular because I am running this with a bash script and I just swapped out the PRSice binary. Otherwise, exact same files.

Edit: I might have pasted a second-run, after the first run did screening. So the files are the same, but the filter might be different. Will need to get back to you, as I’m no longer at the computer tonight.

choishingwan commented 6 years ago

If you don't mind, could you please post the number of line of the valid file generated from the different build? I am now working on writing up unit test for PRSice and hopefully if there're any problem, I can capture them. Thanks

carbocation commented 6 years ago

Unfortunately, I didn't end up needing the tool to run to completion, so I don't have a full answer.

However, on a complete run, I got 0 lines of output from the August version. In contrast, the May hotfix version seemed to be producing valid output at every expected site. (I only let it get ~0.5% of the way complete because I ended up deferring this analysis for something else that came up.)

dp1170 commented 5 years ago

Hi Sam, Pls see the PRSice error below:

Error: No valid variant remaining

I checked that:

LOG FILE: 2019-03-13 22:11:11 Rscript PRSice.R
--prsice /usr/local/bin/PRSice \ --A1 Effect_allele \ --A2 Non_Effect_allele \ --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \ --base GWAS_summary_1.txt \ --beta \ --binary-target T \ --bp Position \ --chr Chromosome \ --clump-kb 250 \ --clump-p 1.000000 \ --clump-r2 0.100000 \ --interval 5e-05 \ --lower 0.0001 \ --model add \ --out my_results \ --pvalue Pvalue \ --se SE \ --seed 835715086 \ --snp MarkerName \ --stat Beta \ --target my_plink_input \ --thread 6 \ --upper 0.5

Loading Genotype file: my_plink_input (bed)

1201 people (693 male(s), 508 female(s)) observed 1201 founder(s) included

2835792 ambiguous variant(s) excluded 16115311 variant(s) included

1 region included

There are a total of 1 phenotype to process

Start processing GWAS_summary_1 ==============================

Reading 100.00% Base file: GWAS_summary_1.txt 7055881 variant(s) observed in base file, with: 7055881 variant(s) not found in target file 0 total variant(s) included from base file

Error: No valid variant remaining

Error: Execution halted

Thank you.

choishingwan commented 5 years ago

While the SNPs might have the same position, as long as their variant ID doesn't match, they will be counted as missing. It is possible that your base and target use a different naming system for their SNPs

mehul4frnds commented 5 years ago

Hi Sam, I am facing the same issue of no variants remaining. Please find the log below: RSice 2.1.2.beta (31 May 2018) https://github.com/choishingwan/PRSice (C) 2016-2017 Shing Wan (Sam) Choi, Jack Euesden, Cathryn M. Lewis, Paul F. O'Reilly GNU General Public License v3

If you use PRSice in any published work, please cite: Jack Euesden Cathryn M. Lewis Paul F. O'Reilly (2015) PRSice: Polygenic Risk Score software. Bioinformatics 31 (9): 1466-1468

2019-03-25 16:26:40 ./PRSice_linux \ --A1 A1 \ --A2 A2 \ --all-score \ --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \ --base /home/cnap_lab/Mehul_prs_25032019/PRSice_linux/glgc_25032019.assoc \ --beta \ --binary-target F \ --bp BP \ --chr CHR \ --extract /home/cnap_lab/Mehul_prs_25032019/prs_illumina_25032019.valid \ --info-base INFO,0.9 \ --interval 0.000050 \ --keep-ambig \ --lower 0.000100 \ --model add \ --no-clump \ --no-regress \ --out /home/cnap_lab/Mehul_prs_25032019/prs_illumina_25032019 \ --perm 10000 \ --print-snp \ --pvalue P \ --seed 2628881950 \ --snp SNP \ --stat BETA \ --target /mnt/Data/Genotype_DATA_cnap_lab/hrc_GODARTS/affy_hrc/affy6b37_GD20062016forimp \ --thread 1 \ --type bed \ --upper 0.500000

Loading Genotype file: /mnt/Data/Genotype_DATA_cnap_lab/hrc_GODARTS/affy_hrc/affy6b37_GD20062016forimp (bed)

3884 people (0 male(s), 0 female(s)) observed 3884 founder(s) included

8212 ambiguous variant(s) kept 64845 variant(s) included

1 region included

Start processing glgc_25032019 ==============================

Base file: /home/cnap_lab/Mehul_prs_25032019/PRSice_linux/glgc_25032019.assoc 44 variant(s) observed in base file, with: 44 variant(s) not found in target file 0 total variant(s) included from base file

Error: No valid variant remaining I look forward to your reply Thanks

choishingwan commented 5 years ago

Could you please try using 2.1.9? Also, with only 44 variants in base, it is highly possible for none of the SNPs be found within the target dataset

mehul4frnds commented 5 years ago

Hi Sam, Thanks for reply. I have tried with 2.1.9. It worked, if I use individual files. I am able to get scores chromosome wise. But, When I gave command to run all files with #. It said 'Killed' Error : Execution halted. Please find the log file and suggest. Thanks prs_illumina_26032019.log

choishingwan commented 5 years ago

That’d most likely due to lack of memory. For example, with UKBB data, you will need around 40Gb of memory to process the files On Tue, 26 Mar 2019 at 1:15 PM, mehul4frnds notifications@github.com wrote:

Hi Sam, Thanks for reply. I have tried with 2.1.9. It worked, if I use individual files. I am able to get scores chromosome wise. But, When I gave command to run all files with #. It said 'Killed' Error : Execution halted. Please find the log file and suggest. Thanks prs_illumina_26032019.log https://github.com/choishingwan/PRSice/files/3009358/prs_illumina_26032019.log

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/69#issuecomment-476754490, or mute the thread https://github.com/notifications/unsubscribe-auth/ABM44kDE08Wm9sx78JkfI2qH23_WSPhZks5valWwgaJpZM4VTLGn .