Closed rempsyc closed 2 years ago
Yeah we should na.omit(). Not sure that a message is needed in every case, but if multiple messages are called, we should give a message if someone asks for the omnibus outlier detection.
Current behaviour is that the function fails as soon as an error is encountered by one of the multivariate functions, so there are no warnings, just an error message (and just one).
check_outliers(x, c("mahalanobis_robust", "mcd", "ics", "optics", "lof", "mahalanobis"))
Error in svd(scale(x)) : infinite or missing values in 'x'
Are you saying that not only for mahalanobis, but for all methods, we should default to na.omit
+ warning so that no errors are thrown at all? In this way, we could print the warning at the very start of the pipeline when the data is filtered for NAs, thus resulting in a single message but also methods that are not failing (anymore).
I'm not exactly sure of the behavior for various components. For mahalanobis, na.omit sounds fine.
Mostly I think folks should not check data pre-hoc but focus on things like elpd or Cook's D for post-model influence diagnostics
I feel like best practice would be to be consistent so people can generally know what to expect: either na.omit
for all methods, or throw an error for all methods. Or at least consistent within the multivariate methods, since they all fail similarly (but the univariate methods do not because they go column by column).
Perhaps a compromise would be to default to an error with a message about the missing values. And then, we would also add an optional na.rm
argument (like for mean
, etc.), that the warning could suggest using to get rid of the warning. Thoughts?
Is there any method that allows NA to be included? If not, we should silently drop them and state in the docs.
Seems like univariate methods do support NA values in the current iteration:
devtools::load_all()
#> ℹ Loading performance
x <- rbind(mtcars, c(NA, rep(1, 10)))
performance::check_outliers(x$mpg, method = "zscore")
#> Warning in check_outliers.data.frame(x, method = method, threshold = threshold, :
#> Some values are missing or infinite. Please handle them beforehand, such
#> as with the `na.omit()` function or through imputation (in particular
#> for multivariate methods, which will produce an error otherwise).
#> 2 outliers detected: cases 18, 20.
#> - Based on the following method and threshold: zscore (1.96).
#> - For variable: x$mpg.
#>
#> ------------------------------------------------------------------------
#> Outliers per variable (zscore):
#>
#> $`x$mpg`
#> Row Distance_Zscore
#> 18 18 2.042389
#> 20 20 2.291272
Created on 2022-08-14 by the reprex package (v2.0.1)
So drop silently, but for multivariate methods only, and mention in the docs?
For now, I have decided to favour consistency with other multivariate methods, so I have simply added an error in the presence of missing (or infinite) values for Mahalanobis. We can always change it later if needed if/when we reach a consensus.
Note: I initially added a general warning when missing values are present suggesting to handle them beforehand. However, I removed it since I feel a preference above to avoid warnings if possible.
My preference is messages for general information and errors if it is something that actually is a problem or makes the result suspect. Usually I find warnings ambiguous as to which of those categories they fall under
message: the weather is nice today warning: the weather is nice today, but it could start raining error: you wanted sunshine to go out, but it's raining like hell, stay inside.
I don't like overly verbose stuff ~like the invasive SUGGESTION in parameters that my coefficients are in log odds and that I might to want to transform them YES I KNOW thank you but no thank you there's no need to spam me that everytime I want to look at the parameters /rantfinished~
That's why I added an option to turn off those message just for you ❤️ , and we probably should document the available option and make them more visible.
(this commit now changes the behaviour that the log-warning is only shown once per session - and I saw that the "global options" are documented in ?model_parameters
and ?print.parameters_model
, hopefully clearly visible?
awesome!
message: the weather is nice today warning: the weather is nice today, but it could start raining error: you wanted sunshine to go out, but it's raining like hell, stay inside.
Strong disagree. "The weather is nice today" should just be printed text, not an exception at all.
But a message is no exception?
message()
, warning()
, and stop()
are the three kinds of exceptions
But we switched from print()
to message()
to allow suppressMessages()
, e.g. for use in other packages?
Strong disagree. "The weather is nice today" should just be printed text, not an exception at all.
and I thought your main concern was about warning()
, not message()
. So you prefer just to use stop()
?
[Note: This is just a placeholder and PR #443 fixes this.]
Colleagues I have convinced of using
check_outliers
have wondered why mahalanobis never finds any missing data. It seems that in the presence of NA values, something goes wrong, even with the most strict of thresholds:So turns out that colleagues might have erroneously reported no outliers just because they had a single missing value because there is no appropriate warning to this effect. This outcome is not new; the cause seems to lie within the base R mahalanobis function:
How would you suggest addressing this? Should we throw a warning and use
na.omit
or a variant thereof, or just throw an error and ask people to deal with it beforehand? Seems like the way other multivariate methods “deal” with it, so that’s what I did for now.(Note: the
error
argument was just for the reprex and I have changed it to be the default behaviour.)