choishingwan / PRS-Tutorial

A tutorial on how to run basic polygenic risk score analysis
MIT License
68 stars 104 forks source link

PRSice would calculate allele frequencies even if --maf is not there #33

Closed ptn24 closed 2 years ago

ptn24 commented 2 years ago

It seems like PRSice would calculate allele frequencies even if --maf is not there. Is that intended? If it is not necessary, then is there a way to turn it off?

PRSice 2.3.5 (2021-09-20) 
https://github.com/choishingwan/PRSice
(C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly
GNU General Public License v3
If you use PRSice in any published work, please cite:
Choi SW, O'Reilly PF.
PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.
GigaScience 8, no. 7 (July 1, 2019)
2021-12-01 08:17:57
./PRSice_linux \
    --a1 A1 \
    --a2 A2 \
    --allow-inter  \
    --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \
    --base ... \
    --binary-target T \
    --bp BP \
    --chr CHR \
    --clump-kb 250kb \
    --clump-p 1.000000 \
    --clump-r2 0.100000 \
    --extract ... \
    --interval 5e-05 \
    --keep ... \
    --lower 5e-08 \
    --num-auto 22 \
    --out PRSice \
    --pheno ... \
    --pheno-col trait_prsice \
    --pvalue P \
    --seed 1822096333 \
    --snp SNP \
    --stat BETA \
    --target-list target-list,ukb22828_c21_b0_v3.sample \
    --thread 16 \
    --type bgen \
    --upper 0.5

Initializing Genotype info from file: target-list (bgen) 
With external fam file: ukb22828_c21_b0_v3.sample 

Start processing 
... 
================================================== 

Only one column detected, will assume only SNP ID is 
provided 

Base file: 
... 
GZ file detected. Header of file is: 
CHR     BP      A1      A2      SNP     P       OR 

16543335 variant(s) observed in base file, with: 
7463751 variant(s) excluded based on user input 
1267696 ambiguous variant(s) excluded 
7811888 total variant(s) included from base file 

Loading Genotype info from target 
================================================== 

487409 people (222969 male(s), 264266 female(s)) observed 
41291 founder(s) included 

1152901 variant(s) not found in previous data 
108257 variant(s) included 

Calculate MAF and perform filtering on target SNPs 
================================================== 

108257 variant(s) included
...
ptn24 commented 2 years ago

https://github.com/choishingwan/PRSice/issues/284

choishingwan commented 2 years ago

That's more or less a bug in terms of log message. What actually happened is that, the MAF filtering function is also responsible for generating the intermediate files. As such, when generating the intermediate, we will state that we are calculating the MAF and filtering but no filtering was actually done

ptn24 commented 2 years ago

I see. Thank you for responding

In that case, it seems like PRSice takes a long time to generate intermediate files. I also notice it uses only 1 CPU (out of 16 in what I pasted above). Is there any way to make it faster and/or parallelize it?

choishingwan commented 2 years ago

Yes, unfortunately due to the amount of workload and my coding ability, PRSice can take rather long to generate the intermediates. You can make it faster by first converting the bgen files to bed formats using plink2.0, which has a much better performance than PRSice. However, this does remove the dosage uncertainty information.

Sam

ptn24 commented 2 years ago

Thank you!