gaow / SEQPower

Power Analysis and Sample Size Estimation Software for Sequence Based Association Studies
GNU General Public License v2.0
2 stars 2 forks source link

PAR model increased power for SKAT with less causal variants #2

Closed gaow closed 9 years ago

gaow commented 9 years ago

Below is an issue reported by a user via email

I'm running a fixed PAR model (--PAR_rare_detrimental 0.05) and then wondered what would be the effect when I decrease the proportion_detrimental parameter. When I move from -P 0.40 to -P 0.20 I get more significant SKAT results in general (P = 1x10^-10 to 6x10^-8).

This goes in the other direction that I was expecting, wouldn't you expect to have bigger P-values when you decrease the proportion of detrimental variants in the same gene?

We ran 10 independent runs, and in all cases using -P 0.2 yield more significant P-values.

These are the two lines I'm using:

spower 'PAR' 'split/EuropeanAmericanEVS6500.causal.17.6945054.sfs' --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --baseline_effect 0.12 --PAR_rare_detrimental 0.05 --PAR_rare_protective 0 --PAR_variable -P 0.4 -Q 0 -r 1 -m 'GroupWrite SimDataEVS_PAR.0.05.FixedLoci' --recode -r 10
spower 'PAR' 'split/EuropeanAmericanEVS6500.causal.17.6945054.sfs' --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --baseline_effect 0.12 --PAR_rare_detrimental 0.05 --PAR_rare_protective 0 --PAR_variable -P 0.2 -Q 0 -r 1 -m 'GroupWrite SimDataEVS_PAR.0.05.FixedLoci' --recode -r 10
gaow commented 9 years ago

Here is my reply on Sept 3, 2015

I tried to check on this by actually performing power analysis via SKAT for just 100 replicates. The input data I used is under the "test" folder of SEQPower source code (https://github.com/gaow/SEQPower/tree/master/test/data). I added "-l 1" option to just focus on the first gene, for a quick demo.

# analysis 1
spower PAR data/EA.txt  --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --baseline_effect 0.12 --PAR_rare_detrimental 0.05 --PAR_rare_protective 0 --PAR_variable -P 0.4 -Q 0 -m "SKAT disease" -r 100 -l 1
spower show EA.csv power*

 +-------+--------------+-----------------+
| power | power_median |    power_std    |
 +-------+--------------+-----------------+
 |  0.97 |     1.0      | 0.0170587221092 |
 +-------+--------------+-----------------+

 # analysis 2
spower PAR data/EA.txt  --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --baseline_effect 0.12 --PAR_rare_detrimental 0.05 --PAR_rare_protective 0 --PAR_variable -P 0.2 -Q 0 -m "SKAT disease" -r 100 -l 1
spower show EA.csv power*

 +-------+--------------+-----------------+
 | power | power_median |    power_std    |
 +-------+--------------+-----------------+
 |  0.88 |     1.0      | 0.0324961536185 |
 +-------+--------------+-----------------+

You see output of analysis 1 is 0.97 (statistical power), which is greater than output of analysis 2 of 0.88 -- this agrees with our expectation.

For starters I'd suggest you do the same to your data and see if there is anything out of ordinary. We can then decide what to do next.

gaow commented 9 years ago

Here is reply I get on Sept 23, 2015

I replicated the commands using the input data that you provided to me. Indeed, I obtained similar results to yours.

However, when I tried to apply the same command to different input data, I obtained inconsistent results.

The input data we used contains the lines corresponding to the genes A1BG and A1CF (these genes are also included in your input data) extracted from the data set EuropeanAmericanEVS6500.sfs.gz available at the SEQPower web site.

# analysis 1 -P 0.4

spower PAR testpower.txt  --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --baseline_effect 0.12 --PAR_rare_detrimental 0.05 --PAR_rare_protective 0 --PAR_variable -P 0.4 -Q 0 -m "SKAT disease" -r 100
spower show testpower.csv power*

+-------+--------------+------------------+
| power | power_median |    power_std     |
+-------+--------------+------------------+
|  0.1  |     0.0      |       0.03       |
|  0.01 |     0.0      | 0.00994987437107 |
+-------+--------------+------------------+

# analysis 2 -P 0.2

spower PAR testpower.txt  --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --baseline_effect 0.12 --PAR_rare_detrimental 0.05 --PAR_rare_protective 0 --PAR_variable -P 0.2 -Q 0 -m "SKAT disease" -r 100 
spower show testpower.csv power*

+-------+--------------+------------------+
| power | power_median |    power_std     |
+-------+--------------+------------------+
|   0.76 |     1.0      | 0.0427083130081   |
|   0.48 |      0.0     | 0.048774993593  |
+-------+--------------+------------------+

Any idea about the reason for obtaining such inconsistent results ?

testpower.txt

gaow commented 9 years ago

To figure out if the issue is SKAT specific I added CMC Fisher test in parallel with SKAT and run the analysis.

Analysis 1, -P 0.2

spower PAR testpower.txt  --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --PAR_rare_detrimental 0.05 -P 0.2 -m "SKAT disease" "CFisher"  -r 50 -o analysis1 -l 1

spower show analysis1.csv name method power case_cmaf ctrl_cmaf cmaf cmaf_detrimental
+------+---------+-------+----------------+----------------+------------------+----------------+
| name |  method | power |   case_cmaf    |      cmaf      | cmaf_detrimental |   ctrl_cmaf    |
+------+---------+-------+----------------+----------------+------------------+----------------+
| A1BG | CFisher |  0.82 | 0.626811084211 | 0.617447232706 |  0.167775701076  | 0.617447232706 |
| A1BG |   SKAT  |  0.5  | 0.626811084211 | 0.617447232706 |  0.167775701076  | 0.617447232706 |
+------+---------+-------+----------------+----------------+------------------+----------------+

Analysis 2, -P 0.4

spower PAR testpower.txt  --sample_size 5000 --alpha 0.0000025 -v2 -j 8 --p1 0.5 --PAR_rare_detrimental 0.05 -P 0.4 -m "SKAT disease" "CFisher"  -r 50 -o analysis2 -l 1

spower show analysis2.csv name method power case_cmaf ctrl_cmaf cmaf cmaf_detrimental
+------+---------+-------+----------------+----------------+------------------+----------------+
| name |  method | power |   case_cmaf    |      cmaf      | cmaf_detrimental |   ctrl_cmaf    |
+------+---------+-------+----------------+----------------+------------------+----------------+
| A1BG | CFisher |  0.64 | 0.626853377448 | 0.617447232706 |  0.319546364579  | 0.617447232706 |
| A1BG |   SKAT  |  0.04 | 0.626853377448 | 0.617447232706 |  0.319546364579  | 0.617447232706 |
+------+---------+-------+----------------+----------------+------------------+----------------+

It seems though 1) cumulative MAF for detrimental variants increased as expected 2) case_cmaf is about the same for -P 0.4 than for -P 0.2, yet the power decreased which is unexpected. Also And this is not just issue with SKAT.

gaow commented 9 years ago

Looking into the difference in MAF between case/ctrl under two models I realized that the problem lies in the definition of PAR model. The PAR defined here follows from Madsen & Browning (2009), which is assigned to "a disease-associated gene with an overall population attributable risk" (quote from the paper), as opposed to in the LOGIT model where effects are assigned per variant basis. So in this case the total PAR remains constant regardless of change in -P. Consequently when -P is small there are less sites involved, but each will have a greater effect size, thus greater difference in MAF between case / ctrl.

For example for A1BG gene in the test data, when -P 0.2 the difference in MAF between case/ctrl on the gene, excluding zero difference, is

[0.0022509703850494608, 0.0022509703850494608, 0.0022509703850494608, 0.0022509703850494608, 0.0022509594545195083, 0.002249705432400256, 0.002249705432400256, 0.002249705432400256, 0.002247280133852163, 0.0022416806337502626, 0.0022228761626387535]

when -P 0.4 it is:

[0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.0008891789515674869, 0.000889157978048487, 0.000889157978048487, 0.000888701401762043, 0.000888701401762043, 0.0008878361594087878, 0.0008878326648415699, 0.0008853628456715443, 0.0008850813976651857, 0.0008853188701297065, 0.0008858888673815113, 0.0008868128113774772, 0.0009046790365030196]

The cumulative difference with -P 0.2 (diff = 0.0247157942222) is about the same as when -P 0.4 (diff = 0.0248913946377). But there are less sites with higher diff in MAF for -P 0.2 vs. more sites with lower diff in MAF for -P 0.4. Ultimately there appears more power for the former case.

I am not yet sure what to do. I do not want to change the definition of PAR to per variant because it will differ from Madsen & Browning and some other paper that uses their setup. Perhaps will make a note or disable -P option for that model. I suggest using LOGIT model anyways because that's a more natural way of modeling phenotypes.