easystats / datawizard

Magic potions to clean and transform your data 🧙
https://easystats.github.io/datawizard/
Other
212 stars 16 forks source link

Winsorize does not ignore missing values #541

Closed tbbarr closed 2 days ago

tbbarr commented 2 weeks ago

Hi there,

It seems the winsorize function does not have the common na.rm option that most R stats functions have. As such, when winsorising an NA value messes up the winsorisation.

Consider the example below:

values <- rnorm(1000)
values <- c(values, rep(NA, 1000))
print(max(values, na.rm=TRUE))
[1] 3.639574
values_wins <- winsorize(values, percentile=0.01)
print(max(values_wins, na.rm=TRUE))
[1] 3.639574  # Should be approx 2.3 based on z-scores

I generate 1,000 random values and add in 1,000 missing values. Winsorising at the 1% level should set the max lower than the unwinsorised max (i.e. to the first percentile value in the data) but here winsorising does nothing. I would have thought that adding in say just 50 or so missing values does enough to complete negate the winsorisation if they're simply treated as being in the tails (since that's more than 1% of the values on either tail) but you need to add in about 300 NA values before the winsorisation does nothing so I'm not really sure how NA values are being treated.

Edit: Some further investigation makes me think winsorize is not calculating percentiles correctly at all when I compare it to my own calculations using quantile.

etiennebacher commented 3 days ago

@mattansb do you know how handling of missing values should be done in winsorize()?

mattansb commented 2 days ago

I think they should be ignored.

tbbarr commented 2 days ago

Edit: Sorry! I'm using the wrong argument input; it should be threshold not percentile!

~While we're here I mentioned in my edit that I don't think percentiles are being calculated correctly at all. To illustrate:~

library(datawizard)

set.seed(1)

values <- rnorm(1000)
print(max(values, na.rm=TRUE))
[1] 3.810277
values_wins <- winsorize(values, percentile=0.01)
print(max(values_wins, na.rm=TRUE))
[1] 0.852815

# No missing values here but this max value is far smaller than what we'd
# expect the winsorised max value to be.

# I don't recall if percentile=0.01 is 0.5th percentile or 1st percentile
# but it doesn't make a difference for this example
print(quantile(values, probs = 0.01))
       1% 
-2.424401 
print(quantile(values, probs = 0.99))
     99% 
2.308112 

# 0.8 is closer to winsorising at the 20th percentile
print(quantile(values, probs = 0.2))
       20% 
-0.8815065 
print(quantile(values, probs = 0.8))
     80% 
0.853734 
tbbarr commented 2 days ago

Sorry for the confusion, this stems from me not using the correct function argument - percentile=0.01 instead of threshold=0.01. The issue goes away once you fix that (NA values are not an issue). Might be worth a warning in the future if someone is making this mistake but feel free to close the issue.