easystats / performance

:muscle: Models' quality and performance metrics (R2, ICC, LOO, AIC, BF, ...)
https://easystats.github.io/performance/
GNU General Public License v3.0
1.02k stars 89 forks source link

`check_outliers`: Silent malfunction with missing values and `method = mahalanobis` #466

Closed rempsyc closed 2 years ago

rempsyc commented 2 years ago

[Note: This is just a placeholder and PR #443 fixes this.]


Colleagues I have convinced of using check_outliers have wondered why mahalanobis never finds any missing data. It seems that in the presence of NA values, something goes wrong, even with the most strict of thresholds:

# Load package
devtools::load_all()
#> ℹ Loading performance

check_outliers(airquality, "mahalanobis")
#> OK: No outliers detected.
#> - Based on the following method and threshold: mahalanobis (14.45).
#> - For variables: Ozone, Solar.R, Wind, Temp, Month, Day
check_outliers(na.omit(airquality), "mahalanobis", threshold = 15)
#> 3 outliers detected: cases 7, 34, 77.
#> - Based on the following method and threshold: mahalanobis (15).
#> - For variables: Ozone, Solar.R, Wind, Temp, Month, Day.

check_outliers(airquality, "mahalanobis", threshold = 1)
#> OK: No outliers detected.
#> - Based on the following method and threshold: mahalanobis (1).
#> - For variables: Ozone, Solar.R, Wind, Temp, Month, Day

x <- rbind(mtcars, c(NA, rep(1, 10)))

check_outliers(x, "mahalanobis", threshold = 1)
#> OK: No outliers detected.
#> - Based on the following method and threshold: mahalanobis (1).
#> - For variables: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb

So turns out that colleagues might have erroneously reported no outliers just because they had a single missing value because there is no appropriate warning to this effect. This outcome is not new; the cause seems to lie within the base R mahalanobis function:

stats::mahalanobis(x, center = colMeans(x), cov = stats::cov(x))
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>                  NA                  NA                  NA                  NA 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>                  NA                  NA                  NA                  NA 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>                  NA                  NA                  NA                  NA 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>                  NA                  NA                  NA                  NA 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>                  NA                  NA                  NA                  NA 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>                  NA                  NA                  NA                  NA 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>                  NA                  NA                  NA                  NA 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>                  NA                  NA                  NA                  NA 
#>                  33 
#>                  NA

How would you suggest addressing this? Should we throw a warning and use na.omit or a variant thereof, or just throw an error and ask people to deal with it beforehand? Seems like the way other multivariate methods “deal” with it, so that’s what I did for now.

check_outliers(x, "mahalanobis_robust")
#> Error in svd(scale(x)): infinite or missing values in 'x'
check_outliers(x, "mcd")
#> Error in MASS::cov.rob(x, quantile.used = percentage_central * nrow(x), : missing or infinite values are not allowed
check_outliers(x, "ics")
#> 'check_outliers()' does not support models of class 'data.frame'.
#> Error in .check_outliers_ics(x, threshold = thresholds$ics, ID.names = ID.names): trying to get slot "ics.dist.cutoff" from an object of a basic class ("NULL") with no slots
check_outliers(x, "optics")
#> Error in dbscan::kNN(x, k, sort = TRUE, ...): data/distances cannot contain NAs for kNN (with kd-tree)!
check_outliers(x, "lof")
#> Error in dbscan::lof(x, minPts = ncol(x)): NAs not allowed for LOF using kdtree!
check_outliers(x, "mahalanobis", error = TRUE)
#> Error in .check_outliers_mahalanobis(x, threshold = thresholds$mahalanobis, : NA values are not allowed for the Mahalanobis method.

(Note: the error argument was just for the reprex and I have changed it to be the default behaviour.)

bwiernik commented 2 years ago

Yeah we should na.omit(). Not sure that a message is needed in every case, but if multiple messages are called, we should give a message if someone asks for the omnibus outlier detection.

rempsyc commented 2 years ago

Current behaviour is that the function fails as soon as an error is encountered by one of the multivariate functions, so there are no warnings, just an error message (and just one).

check_outliers(x, c("mahalanobis_robust", "mcd", "ics", "optics", "lof", "mahalanobis"))
Error in svd(scale(x)) : infinite or missing values in 'x'

Are you saying that not only for mahalanobis, but for all methods, we should default to na.omit + warning so that no errors are thrown at all? In this way, we could print the warning at the very start of the pipeline when the data is filtered for NAs, thus resulting in a single message but also methods that are not failing (anymore).

bwiernik commented 2 years ago

I'm not exactly sure of the behavior for various components. For mahalanobis, na.omit sounds fine.

Mostly I think folks should not check data pre-hoc but focus on things like elpd or Cook's D for post-model influence diagnostics

rempsyc commented 2 years ago

I feel like best practice would be to be consistent so people can generally know what to expect: either na.omit for all methods, or throw an error for all methods. Or at least consistent within the multivariate methods, since they all fail similarly (but the univariate methods do not because they go column by column).

Perhaps a compromise would be to default to an error with a message about the missing values. And then, we would also add an optional na.rm argument (like for mean, etc.), that the warning could suggest using to get rid of the warning. Thoughts?

strengejacke commented 2 years ago

Is there any method that allows NA to be included? If not, we should silently drop them and state in the docs.

rempsyc commented 2 years ago

Seems like univariate methods do support NA values in the current iteration:

devtools::load_all()
#> ℹ Loading performance
x <- rbind(mtcars, c(NA, rep(1, 10)))
performance::check_outliers(x$mpg, method = "zscore")
#> Warning in check_outliers.data.frame(x, method = method, threshold = threshold, : 
#>   Some values are missing or infinite. Please handle them beforehand, such
#>   as with the `na.omit()` function or through imputation (in particular
#>   for multivariate methods, which will produce an error otherwise).
#> 2 outliers detected: cases 18, 20.
#> - Based on the following method and threshold: zscore (1.96).
#> - For variable: x$mpg.
#> 
#> ------------------------------------------------------------------------
#> Outliers per variable (zscore): 
#> 
#> $`x$mpg`
#>    Row Distance_Zscore
#> 18  18        2.042389
#> 20  20        2.291272

Created on 2022-08-14 by the reprex package (v2.0.1)

So drop silently, but for multivariate methods only, and mention in the docs?

rempsyc commented 2 years ago

For now, I have decided to favour consistency with other multivariate methods, so I have simply added an error in the presence of missing (or infinite) values for Mahalanobis. We can always change it later if needed if/when we reach a consensus.

Note: I initially added a general warning when missing values are present suggesting to handle them beforehand. However, I removed it since I feel a preference above to avoid warnings if possible.

bwiernik commented 2 years ago

My preference is messages for general information and errors if it is something that actually is a problem or makes the result suspect. Usually I find warnings ambiguous as to which of those categories they fall under

strengejacke commented 2 years ago

message: the weather is nice today warning: the weather is nice today, but it could start raining error: you wanted sunshine to go out, but it's raining like hell, stay inside.

DominiqueMakowski commented 2 years ago

I don't like overly verbose stuff ~like the invasive SUGGESTION in parameters that my coefficients are in log odds and that I might to want to transform them YES I KNOW thank you but no thank you there's no need to spam me that everytime I want to look at the parameters /rantfinished~

strengejacke commented 2 years ago

That's why I added an option to turn off those message just for you ❤️ , and we probably should document the available option and make them more visible.

strengejacke commented 2 years ago

(this commit now changes the behaviour that the log-warning is only shown once per session - and I saw that the "global options" are documented in ?model_parameters and ?print.parameters_model, hopefully clearly visible?

DominiqueMakowski commented 2 years ago

awesome!

bwiernik commented 2 years ago

message: the weather is nice today warning: the weather is nice today, but it could start raining error: you wanted sunshine to go out, but it's raining like hell, stay inside.

Strong disagree. "The weather is nice today" should just be printed text, not an exception at all.

strengejacke commented 2 years ago

But a message is no exception?

bwiernik commented 2 years ago

message(), warning(), and stop() are the three kinds of exceptions

strengejacke commented 2 years ago

But we switched from print() to message() to allow suppressMessages(), e.g. for use in other packages?

strengejacke commented 2 years ago

Strong disagree. "The weather is nice today" should just be printed text, not an exception at all.

and I thought your main concern was about warning(), not message(). So you prefer just to use stop()?