IMPALA-Consortium / ctas

Time Series Outliers and Anomalies
https://impala-consortium.github.io/ctas/
Other
3 stars 2 forks source link

ks.test pvalues are not reliable for large differences #13

Closed erblast closed 11 months ago

erblast commented 12 months ago
x <- c(9,13,11,8,9,4,7,7,9,7,8)
y <- c(31,38,32,31,35,31,34,32,26,29,31,34,36,34,40,28,24,29,30,28,29,26,29,27,31,32,39,28,35,27,22,32,27,27,23,28,23,35,27,24,31,26,24,22,22,20,23,21,21,22,22,21,34,31,30,28,36,30,32,32,25,30,36,32,21,30,32,29,34,29,25,33,31,34,35,31,22,31,22,29,35,23,31,25,28,35,33,36,24,26,26,30,23,26,24,32,33,35,43,30,33,46,33,29,25,27,23,22,27,31,30,27,21,33,32,37,20,29,39,31,37,33,31,34,21,28,33,35,35,31,36,30,24,30,29,31,28,34,42,22,33,26,27,22,29,32,23,40,26,18,30,40,38,36,35,33,28,22,19,23,20,25,33,27,22,24,22,34,24,29,33,36,39,21,35,26,29,37,29,33,30,27,37,29,30,34,34,26,36,27,26)

ks.test(x, y)
#> 
#>  Exact two-sample Kolmogorov-Smirnov test
#> 
#> data:  x and y
#> D = 1, p-value = NA
#> alternative hypothesis: two-sided

Created on 2023-10-26 with reprex v2.0.2

set.seed(1)
ks.test(rnorm(1000, 5, 0.1), rnorm(1000, 50, 0.1))
#> 
#>  Asymptotic two-sample Kolmogorov-Smirnov test
#> 
#> data:  rnorm(1000, 5, 0.1) and rnorm(1000, 50, 0.1)
#> D = 1, p-value = NA
#> alternative hypothesis: two-sided

Created on 2023-10-26 with reprex v2.0.2

erblast commented 12 months ago
ks.test(rnorm(3, 10, 0.1), rnorm(3, 20, 0.1))
#> 
#>  Exact two-sample Kolmogorov-Smirnov test
#> 
#> data:  rnorm(3, 10, 0.1) and rnorm(3, 20, 0.1)
#> D = 1, p-value = 0.1
#> alternative hypothesis: two-sided

Created on 2023-10-26 with reprex v2.0.2

erblast commented 12 months ago

I cam across this issue when simulating outliers

my best guess is that the p-value gets to low to fit into Rs float vector

D == 1 means that the two samples are not overlapping. This does not automatically mean that there are low p-values if sample sizes are not so big (see example above).

so we could add a check that adds a low p-value maybe 10-6, when D == 1 and is.na(p-value)

erblast commented 11 months ago

https://github.com/IMPALA-Consortium/tsoa/pull/20