Open bkamins opened 1 year ago
Would make sense for mean
as well. Currently, empty mean
returns NaN
for numbers and throws for non-numbers.
Btw, sum
lets you specify the default instead of zero, but the argument is called init
there.
Currently, empty mean returns NaN for numbers and throws for non-numbers.
Yes, but at least it does not throw an error in standard case (but indeed we could allow for providing the other value of default as NaN
, while correct is highly non-standard for Julia data processing)
called
init
there
The reason is that sum
is a reduction, while e.g. median
is not a reduction, so that is why init
is not a best name in this case. But indeed - I was not sure what would be the best name of the kwarg.
Yes, but at least it does not throw an error in standard case
Would be weird to have default
in median
, but not in mean
. And I also wanted this functionality a couple of times!
NaN, while correct is highly non-standard for Julia data processing
Dunno, I like NaN
s the majority of time, it's the only way to signal "there's no number here" in a type-stable way. I mostly wanted to use default
for values like 0 or something similar, when I need an actual number.
that is why init is not a best name in this case
Yeah, init
doesn't seem to fit well for mean/median, I agree. Just pointed out what exists elsewhere.
FWIW, R returns NULL
which is roughly equivalent to nothing
.
For mean
, NaN
seems the only right default choice, considering that mean(x) = sum(x) / length(x)
and 0.0 / 0
is NaN
.
However, if these functions all returned nothing
for empty inputs, setting a default would be a straightforward something(median(x), 0.0)
. It's already common for Julia functions to return nothing
indicating "no result" in some sense, eg findfirst
-like functions.
nothing
is also great in that functions deliberately don't propagate it further. So when a user gets nothing
and forgets to handle it, he gets an exception very soon.
These are all good ideas. But then the question is how to convert it to API in a non-breaking way? We probably need a kwarg that would allow to switch behavior.
Maybe, median
/quantile
can simply return nothing
instead of throwing? Seems nonbreaking, and allows for easy default selection through something(...)
.
Another idea is a general wrapper function so that the interface looks like onempty(median, xs, default=0.0)
.
Seems nonbreaking, and allows for easy default selection through something(...).
It would be non-breaking, but I feared proposing it for such a common function 😄.
a general wrapper function so that the interface looks like
onempty(median, xs, default=0.0)
.
This general wrapper function would require special casing for each function passed as its first argument to be efficient.
The point is that isempty(skipmissing(x))
is O(n) operation in general:
julia> x = missings(Int, 10^8);
julia> @time isempty(skipmissing(x))
0.077613 seconds (1 allocation: 16 bytes)
true
But maybe it is good enough?
The point is that isempty(skipmissing(x)) is O(n) operation in general
Good point! I mostly considered applying functions to an actual materialized array, not to skipmissing
.
An advantage of onempty
vs either median(; default)
or something(median(), default)
is that it makes clear when the default value is used - exactly when the input is empty. The other two variants read more ambiguous regarding when the default gets returned.
OTOH, advantages of something(median(), ...)
is that it doesn't involve creating any extra functions, and is trivially efficient wrt isempty(...)
checks.
@nalimilan - what do you think?
I don't think there's a precedent of a function that returns nothing
when the input is empty, right? So adding a default
keyword argument seems simpler. It's too bad we can't call it init
for consistency with sum
.
Some Julia functions throw an exception when no result is available (eg first
), some do return nothing
(eg findfirst
).
Btw, first
-or-default and only
-or-default functions would also useful to have...
In general where possible I personally prefer special value-based behavior rather than error-based. For median
and quantile
returning nothing
is safe.
However, I understand the concern of @nalimilan that it requires a careful design so that we have one common solution rather than many special cases. E.g. nothing
as a special value works for mean
, median
, quantile
, and e.g. std
, but it is not a good value for sum
as:
julia> isnothing(sum([nothing]))
true
It's unfortunate, but is a fact, that functions in Julia tend not to really consider that nothing
can be contained within a collection. For example, findfirst
returns nothing
when it didn't find anything, or when it found the result, and its index is nothing
:
# can distinguish these two:
julia> findfirst(x -> true, Dict(1 => 2))
1
julia> findfirst(x -> false, Dict(1 => 2))
# but not these two:
julia> findfirst(x -> true, Dict(nothing => 2))
julia> findfirst(x -> false, Dict(nothing => 2))
Potentially, median
could reasonably be extended to non-numbers in the future, and return the only element for length-1 collections. But I personally don't see it as a major issue...
Following this https://discourse.julialang.org/t/summarizing-grouped-dataframe-where-a-group-is-entirely-missing/90920/20?u=mreichmpi-bgc my 5 cents from a user perspective (and with a Tidyverse background as the most recent):
I like the general wrapper on_empty
. And I think the model of purrr
in Tidyverse of having wrappers for capturing side effects is quite convenient in data analysis, e.g. safely
, possibly
and quietly
etc. Cf. Capture side effects. — safely • purrr.
In general I believe the user base will expand if such function are added, even at the expense of performance or other design aspects.
I was thinking what is best and came to the conclusion that we could define somewhere:
withempty(f, default) = x -> isempty(x) ? default : f(x)
(name is tentative)
With skipmissing
the performance will not be optimal, but I think most of the time users will not complain (you would need to have a vector of all missing
values with millions of such elements to notice the problem.
In this post I have shown an example usage.
I also like this solution, but wonder if it's possible to generalize in the following sense, while keeping simple and intuitive.
Often, one wants to compute mean
/median
/std
/whatever only when at least several values are available. If you are interested in the distribution location, mean
of a single sample isn't of much use - even more so for scale measurements such as std
.
From time to time, I find myself doing things like length(x) >= 4 ? mean(x) : NaN
. Conceptually, this seems very similar to the ifempty
check discussed here.
Maybe, there's some clean solution to both?.. Anything to borrow from R/others?
We could define:
withcondition(f; default=missing, condition=isempty) = x -> condition(x) ? default : f(x)
(I proposed adding defaults, but we could skip them).
Then, by default we would have "return missing
if empty, but one could write withcondition(median, default=NaN, condition=x->length(x) < 5)
to get what you want.
That sounds overkilll to me. The point of such a function is to be simple and convenient, as users can easily define other functions if needed. Actually we could almost have emptymissing(f) = x -> isempty(x) ? missing : f(x)
to make things even simpler for the common case (and add that to Missings.jl).
I would be also OK with such a solution as it likely covers most of the use cases.
After reading @bkamins blog post I came here to suggest
whenever(predicate, f, default) = x -> predicate(x) ? default : f(x)
It works for isempty
, isnothing
, ismissing
, isnan
. I like the positional version more than the kwargs version suggested above. I'm not committed to the name at the moment.
whenever(predicate, f, default)
is only 3 characters shorter than x -> predicate(x) ? default : f(x)
so I don't really see the point of adding such a general function.
I don't really see the point of adding such a general function.
There are two points that can be considered:
(x -> predicate(x) ? default : f(x))
in practice, and these extra (
and )
create visual noise and are easy to forget.x -> predicate(x) ? default : f(x)
will be re-compiled every time it is written, while whenever(predicate, f, default)
is compiled onceBTW, I'm not totally sold on the argument order being f, default
. whenever(predicate, value, function)
seems closer to the familiar if-then-else
ordering. This ambiguity is the downside of positionals.
Tangential, but I don't really like that the predicate functions we have are positive in the degenerate case rather than the good case. If there were better words for nonmissing
/nonempty
/nonnan
that could be nicer than the current system.
We could define:
withcondition(f; default=missing, condition=isempty) = x -> condition(x) ? default : f(x)
(I proposed adding defaults, but we could skip them).
I really like that most. I like, when there are reasonable, intuitive defaults that cover most of the cases, but then allow the user to decide to make things as complicated as she/he wants.
If would probably like to see these three functions (at least):
catchCondition(f; condition=isempty, default=missing) = x -> condition(x) ? default : f(x)
catchAny(f; condition=isempty, default=missing) = x -> condition(x) ? default : try f(x) catch e default end
catchMissing(f; catchFun=catchCondition) = x -> catchFun(f)(skipmissing(x))
The last one would just allow avoiding typing skipmissing
everytime. Just for convenience, but I think important for easy usage.
I would like to be able to pass the kwargs
from catchMissing
also to the other functions, but somehow catchMissing(f; catchFun=catchCondition, kwargs...)
did not work (sorry too new to Julia...).
Tangential, but I don't really like that the predicate functions we have are positive in the degenerate case rather than the good case. If there were better words for nonmissing/nonempty/nonnan that could be nicer than the current system.
Agree - that's why I'd suggest having "catch" in the name, which makes it more intuitive logically IMO.
whenever(predicate, f, default) = x -> predicate(x) ? default : f(x)
It works for isempty, isnothing, ismissing, isnan. I like the positional version more than the kwargs version suggested above. I'm not committed to the name at the moment.
I really like the keyword version with defaults much more. Also I think for the user it is very important to have a consistent design, e.g. here always have f
at the beginning. This is what makes tidyverse
so attractive I believe.
Another idea for a general and convenient interface, that encompasses both isempty and length checks:
iflength(median, 0 => NaN)(x) # NaN if x is empty
iflength(median, 0:3 => NaN)(x) # NaN if length(x) is <= 3
Also, no matter what predicates are incorporated, I believe
default=missing
shouldn't be included by default. If anything, the default value should be nothing
as common in Julia elsewhere: at least, nothing
immediately throws on almost any operation instead of silently propagating, and cannot be forgotten.
Or (even better, I think) the default
argument shouldn't have a default value ( :) ) at all, so that users always choose explicitly. Two common propagating not-a-value placeholders are NaN and missing.
Apart from the general solution discussed here, in https://github.com/JuliaData/Missings.jl/pull/143 I propose adding emptymissing
as @nalimilan since it covers the most common problematic case.
Thanks @bkamins. Regarding the discussion about more general suggestions, note that this issue is currently in Statistics.jl, but it wouldn't be super logical to add such a general function in this package IMO. It's important to think about where it would go if that function was to be added.
It's important to think about where it would go if that function was to be added.
I agree. That is why I thought adding emptymissing
in Missings.jl was a clean choice.
I thought, that once a consensus is reached about the API, a decision can be made where such a function should live (maybe even Base Julia if it is general enough)
In generic code the fact that the following operations error:
is inconvenient (as the only work-around is try-catch). Maybe we could allow passing
default
kwarg that would be returned when the passed collection is empty?