Closed gaow closed 9 years ago
Here is my reply on Sept 3, 2015
I tried to check on this by actually performing power analysis via SKAT for just 100 replicates. The input data I used is under the "test" folder of SEQPower source code (https://github.com/gaow/SEQPower/tree/master/test/data). I added "-l 1" option to just focus on the first gene, for a quick demo.
# analysis 1
spower PAR data/EA.txt --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --baseline_effect 0.12 --PAR_rare_detrimental 0.05 --PAR_rare_protective 0 --PAR_variable -P 0.4 -Q 0 -m "SKAT disease" -r 100 -l 1
spower show EA.csv power*
+-------+--------------+-----------------+
| power | power_median | power_std |
+-------+--------------+-----------------+
| 0.97 | 1.0 | 0.0170587221092 |
+-------+--------------+-----------------+
# analysis 2
spower PAR data/EA.txt --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --baseline_effect 0.12 --PAR_rare_detrimental 0.05 --PAR_rare_protective 0 --PAR_variable -P 0.2 -Q 0 -m "SKAT disease" -r 100 -l 1
spower show EA.csv power*
+-------+--------------+-----------------+
| power | power_median | power_std |
+-------+--------------+-----------------+
| 0.88 | 1.0 | 0.0324961536185 |
+-------+--------------+-----------------+
You see output of analysis 1 is 0.97 (statistical power), which is greater than output of analysis 2 of 0.88 -- this agrees with our expectation.
For starters I'd suggest you do the same to your data and see if there is anything out of ordinary. We can then decide what to do next.
Here is reply I get on Sept 23, 2015
I replicated the commands using the input data that you provided to me. Indeed, I obtained similar results to yours.
However, when I tried to apply the same command to different input data, I obtained inconsistent results.
The input data we used contains the lines corresponding to the genes A1BG and A1CF (these genes are also included in your input data) extracted from the data set EuropeanAmericanEVS6500.sfs.gz available at the SEQPower web site.
# analysis 1 -P 0.4
spower PAR testpower.txt --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --baseline_effect 0.12 --PAR_rare_detrimental 0.05 --PAR_rare_protective 0 --PAR_variable -P 0.4 -Q 0 -m "SKAT disease" -r 100
spower show testpower.csv power*
+-------+--------------+------------------+
| power | power_median | power_std |
+-------+--------------+------------------+
| 0.1 | 0.0 | 0.03 |
| 0.01 | 0.0 | 0.00994987437107 |
+-------+--------------+------------------+
# analysis 2 -P 0.2
spower PAR testpower.txt --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --baseline_effect 0.12 --PAR_rare_detrimental 0.05 --PAR_rare_protective 0 --PAR_variable -P 0.2 -Q 0 -m "SKAT disease" -r 100
spower show testpower.csv power*
+-------+--------------+------------------+
| power | power_median | power_std |
+-------+--------------+------------------+
| 0.76 | 1.0 | 0.0427083130081 |
| 0.48 | 0.0 | 0.048774993593 |
+-------+--------------+------------------+
Any idea about the reason for obtaining such inconsistent results ?
To figure out if the issue is SKAT specific I added CMC Fisher test in parallel with SKAT and run the analysis.
Analysis 1, -P 0.2
spower PAR testpower.txt --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --PAR_rare_detrimental 0.05 -P 0.2 -m "SKAT disease" "CFisher" -r 50 -o analysis1 -l 1
spower show analysis1.csv name method power case_cmaf ctrl_cmaf cmaf cmaf_detrimental
+------+---------+-------+----------------+----------------+------------------+----------------+
| name | method | power | case_cmaf | cmaf | cmaf_detrimental | ctrl_cmaf |
+------+---------+-------+----------------+----------------+------------------+----------------+
| A1BG | CFisher | 0.82 | 0.626811084211 | 0.617447232706 | 0.167775701076 | 0.617447232706 |
| A1BG | SKAT | 0.5 | 0.626811084211 | 0.617447232706 | 0.167775701076 | 0.617447232706 |
+------+---------+-------+----------------+----------------+------------------+----------------+
Analysis 2, -P 0.4
spower PAR testpower.txt --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --PAR_rare_detrimental 0.05 -P 0.4 -m "SKAT disease" "CFisher" -r 50 -o analysis2 -l 1
spower show analysis2.csv name method power case_cmaf ctrl_cmaf cmaf cmaf_detrimental
+------+---------+-------+----------------+----------------+------------------+----------------+
| name | method | power | case_cmaf | cmaf | cmaf_detrimental | ctrl_cmaf |
+------+---------+-------+----------------+----------------+------------------+----------------+
| A1BG | CFisher | 0.64 | 0.626853377448 | 0.617447232706 | 0.319546364579 | 0.617447232706 |
| A1BG | SKAT | 0.04 | 0.626853377448 | 0.617447232706 | 0.319546364579 | 0.617447232706 |
+------+---------+-------+----------------+----------------+------------------+----------------+
It seems though 1) cumulative MAF for detrimental variants increased as expected 2) case_cmaf is about the same for -P 0.4 than for -P 0.2, yet the power decreased which is unexpected. Also And this is not just issue with SKAT.
Looking into the difference in MAF between case/ctrl under two models I realized that the problem lies in the definition of PAR model. The PAR defined here follows from Madsen & Browning (2009), which is assigned to "a disease-associated gene with an overall population attributable risk" (quote from the paper), as opposed to in the LOGIT model where effects are assigned per variant basis. So in this case the total PAR remains constant regardless of change in -P
. Consequently when -P
is small there are less sites involved, but each will have a greater effect size, thus greater difference in MAF between case / ctrl.
For example for A1BG gene in the test data, when -P 0.2
the difference in MAF between case/ctrl on the gene, excluding zero difference, is
[0.0022509703850494608, 0.0022509703850494608, 0.0022509703850494608, 0.0022509703850494608, 0.0022509594545195083, 0.002249705432400256, 0.002249705432400256, 0.002249705432400256, 0.002247280133852163, 0.0022416806337502626, 0.0022228761626387535]
when -P 0.4
it is:
[0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.000889157978048487, 0.000889157978048487, 0.000888701401762043, 0.000888701401762043, 0.0008878361594087878, 0.0008878326648415699, 0.0008853628456715443, 0.0008850813976651857, 0.0008853188701297065, 0.0008858888673815113, 0.0008868128113774772, 0.0009046790365030196]
The cumulative difference with -P 0.2
(diff = 0.0247157942222) is about the same as when -P 0.4
(diff = 0.0248913946377). But there are less sites with higher diff in MAF for -P 0.2
vs. more sites with lower diff in MAF for -P 0.4
. Ultimately there appears more power for the former case.
I am not yet sure what to do. I do not want to change the definition of PAR to per variant because it will differ from Madsen & Browning and some other paper that uses their setup. Perhaps will make a note or disable -P
option for that model. I suggest using LOGIT model anyways because that's a more natural way of modeling phenotypes.
Below is an issue reported by a user via email