Closed tbbarr closed 2 days ago
@mattansb do you know how handling of missing values should be done in winsorize()
?
I think they should be ignored.
Edit: Sorry! I'm using the wrong argument input; it should be threshold
not percentile
!
~While we're here I mentioned in my edit that I don't think percentiles are being calculated correctly at all. To illustrate:~
library(datawizard)
set.seed(1)
values <- rnorm(1000)
print(max(values, na.rm=TRUE))
[1] 3.810277
values_wins <- winsorize(values, percentile=0.01)
print(max(values_wins, na.rm=TRUE))
[1] 0.852815
# No missing values here but this max value is far smaller than what we'd
# expect the winsorised max value to be.
# I don't recall if percentile=0.01 is 0.5th percentile or 1st percentile
# but it doesn't make a difference for this example
print(quantile(values, probs = 0.01))
1%
-2.424401
print(quantile(values, probs = 0.99))
99%
2.308112
# 0.8 is closer to winsorising at the 20th percentile
print(quantile(values, probs = 0.2))
20%
-0.8815065
print(quantile(values, probs = 0.8))
80%
0.853734
Sorry for the confusion, this stems from me not using the correct function argument - percentile=0.01
instead of threshold=0.01
. The issue goes away once you fix that (NA values are not an issue). Might be worth a warning in the future if someone is making this mistake but feel free to close the issue.
Hi there,
It seems the
winsorize
function does not have the commonna.rm
option that most R stats functions have. As such, when winsorising anNA
value messes up the winsorisation.Consider the example below:
I generate 1,000 random values and add in 1,000 missing values. Winsorising at the 1% level should set the max lower than the unwinsorised max (i.e. to the first percentile value in the data) but here winsorising does nothing. I would have thought that adding in say just 50 or so missing values does enough to complete negate the winsorisation if they're simply treated as being in the tails (since that's more than 1% of the values on either tail) but you need to add in about 300 NA values before the winsorisation does nothing so I'm not really sure how NA values are being treated.
Edit: Some further investigation makes me think
winsorize
is not calculating percentiles correctly at all when I compare it to my own calculations usingquantile
.