choishingwan / PRSice

A software package for calculating, applying, evaluating and plotting the results of polygenic risk scores
http://prsice.info
GNU General Public License v3.0
180 stars 85 forks source link

Too many variants with mismatch information #329

Closed sakuramodokich closed 1 year ago

sakuramodokich commented 1 year ago

Hi, I found that SNPs in the genotype files have been excessively filtered:

1837501 variant(s) not found in previous data 
4482871 variant(s) with mismatch information 
37522 variant(s) included 

My output:

PRSice 2.3.5 (2021-09-20) 
https://github.com/choishingwan/PRSice
(C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly
GNU General Public License v3
If you use PRSice in any published work, please cite:
Choi SW, O'Reilly PF.
PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.
GigaScience 8, no. 7 (July 1, 2019)
2023-07-06 04:36:09
./PRSice_linux \
    --a1 Allele1 \
    --a2 Allele2 \
    --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \
    --base Roselli_2018_AF_HRC_GWAS_EURv11.txt \
    --beta  \
    --binary-target T \
    --bp pos \
    --chr chr \
    --clump-kb 250kb \
    --clump-p 1.000000 \
    --clump-r2 0.100000 \
    --extract UKB_imputed.valid \
    --ignore-fid  \
    --interval 5e-05 \
    --keep af_df_sample_ID.txt \
    --lower 5e-08 \
    --num-auto 22 \
    --out UKB_imputed \
    --pheno af_df.phe \
    --pheno-col af_cc \
    --pvalue P-value \
    --seed 928429407 \
    --snp MarkerName \
    --stat Effect \
    --target ukb21008_c#_qc_pass \
    --thread 36 \
    --upper 0.5

Initializing Genotype file: ukb21008_c#_qc_pass (bed) 

Start processing Roselli_2018_AF_HRC_GWAS_EURv11 
================================================== 

SNP extraction/exclusion list contains 5 columns, will 
assume first column contains the SNP ID 

Base file: Roselli_2018_AF_HRC_GWAS_EURv11.txt 
Header of file is: 
MarkerName  Allele1 Allele2 chr pos Effect  StdErr  P-value 

9362422 variant(s) observed in base file, with: 
1424010 variant(s) excluded based on user input 
7938412 total variant(s) included from base file 

Loading Genotype info from target 
================================================== 

488315 people (223502 male(s), 264624 female(s)) observed 
337053 founder(s) included 

1837501 variant(s) not found in previous data 
4482871 variant(s) with mismatch information 
37522 variant(s) included 

Phenotype file: af_df.phe 
Column Name of Sample ID: FID 
Note: If the phenotype file does not contain a header, the 
column name will be displayed as the Sample ID which is 
expected. 

There are a total of 1 phenotype to process 

Start performing clumping 

Number of variant(s) after clumping : 3813 

Processing the 1 th phenotype 

af_cc is a binary phenotype 
28063 control(s) 
308990 case(s) 

There are 1 region(s) with p-value less than 1e-5. Please 
note that these results are inflated due to the overfitting 
inherent in finding the best-fit PRS (but it's still best 
to find the best-fit PRS!). 
You can use the --perm option (see manual) to calculate an 
empirical P-value.