AllenWLynch / lisa

MIT License
16 stars 9 forks source link

is it appropriate to use the wilcox test ? #3

Open shangguandong1996 opened 3 years ago

shangguandong1996 commented 3 years ago

Hi, Allen sorry to bother you again. I just find a interesting thing about wilcox test when using percent data.

Here is the simulation data. you can see TF A in test_input_1 is more "notable" in test_input_2. But the significance of wilcox p-value is opposite

> set.seed(19960203)
> (test_input_1 <- c(rep(0, 10), runif(30, 0.4, 1)))
 [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [9] 0.0000000 0.0000000 0.8665977 0.5077563 0.5533173 0.4572644 0.6162255 0.5682280
[17] 0.8628538 0.4286266 0.7475414 0.4524465 0.5702689 0.6562953 0.5538238 0.7818898
[25] 0.4664375 0.4523438 0.4650133 0.9592675 0.6558759 0.8989666 0.7105717 0.5185779
[33] 0.7115430 0.4853153 0.7069913 0.4881373 0.5447021 0.5455549 0.4653186 0.4751753
> (test_input_2 <- c(rep(0, 6), runif(34, 0.1, 0.2)))
 [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.1707186 0.1439140
 [9] 0.1893698 0.1454282 0.1110170 0.1213776 0.1659214 0.1045522 0.1818943 0.1754438
[17] 0.1192141 0.1858865 0.1498234 0.1101247 0.1011694 0.1885994 0.1734860 0.1131505
[25] 0.1006139 0.1953781 0.1328520 0.1849950 0.1367957 0.1202028 0.1485642 0.1163460
[33] 0.1302627 0.1538211 0.1079418 0.1368364 0.1383502 0.1724555 0.1857568 0.1412299

> test_background <- runif(200, 0, 0.1)
> wilcox.test(test_input_1, test_background, alternative = "greater")$p.value
[1] 3.041756e-07
> wilcox.test(test_input_2, test_background, alternative = "greater")$p.value
[1] 1.431728e-12
> set.seed(19960203)
> (test_input_1 <- c(rep(0, 10), runif(30, 0.4, 1)))
 [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [9] 0.0000000 0.0000000 0.8665977 0.5077563 0.5533173 0.4572644 0.6162255 0.5682280
[17] 0.8628538 0.4286266 0.7475414 0.4524465 0.5702689 0.6562953 0.5538238 0.7818898
[25] 0.4664375 0.4523438 0.4650133 0.9592675 0.6558759 0.8989666 0.7105717 0.5185779
[33] 0.7115430 0.4853153 0.7069913 0.4881373 0.5447021 0.5455549 0.4653186 0.4751753
> (test_input_2 <- c(rep(0, 10), runif(30, 0.1, 0.2)))
 [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [9] 0.0000000 0.0000000 0.1707186 0.1439140 0.1893698 0.1454282 0.1110170 0.1213776
[17] 0.1659214 0.1045522 0.1818943 0.1754438 0.1192141 0.1858865 0.1498234 0.1101247
[25] 0.1011694 0.1885994 0.1734860 0.1131505 0.1006139 0.1953781 0.1328520 0.1849950
[33] 0.1367957 0.1202028 0.1485642 0.1163460 0.1302627 0.1538211 0.1079418 0.1368364
> test_background <- runif(200, 0, 0.1)
> wilcox.test(test_input_1, test_background, alternative = "greater")$p.value
[1] 3.041756e-07
> wilcox.test(test_input_2, test_background, alternative = "greater")$p.value
[1] 3.041756e-07

you can see just little changes of some gene's percent will produce differnet result. So I am wondering whether we should use more powerful test ?

Best wishes Guandong Shang

AllenWLynch commented 3 years ago

The wilcoxon test is a very natural test to use in this case because the distribution of ISD scores is unknown in all conditions, so another non-parametric would be needed. The issue with the wilcoxon test you may be encountering is that it simply tests the probability that a sample X from list 1 is greater than sample Y from list 2, but the relative differences in magnitude do not directly affect p-value. I have thought of this before and briefly explored some options. I will think on it and get back to you.