Open nalimilan opened 5 years ago
From https://discourse.julialang.org/t/how-to-calculate-a-weighted-mean-with-missing-observations/19281/12?u=tkoolen, another option is to simply nix the dims
keyword arguments and rely on eachslice
instead. For example,
[mean(skipmissing(x)) for x in eachslice(A, dims=1)]
instead of mean(skipmissing(A), dims=1
with special methods for SkipMissing
.
Quoting myself from https://discourse.julialang.org/t/how-to-calculate-a-weighted-mean-with-missing-observations/19281/14?u=tkoolen:
https://github.com/JuliaLang/julia/pull/28027 fixes
mapreduce
and the functions that use it likesum
andprod
, but you’d have to do the same forreduce
. Andmean
doesn’t actually usemapreduce
as far as I can tell, so you’d have to fix that as well. How many others are there? On top of that, StaticArrays has to reimplement these functions with dims arguments as well, so there’s even more implementation cost and opportunity for bugs. All of this could be avoided by just not having the dims keyword arguments and using theeachslice
API to separate these concerns.
Please discuss this at https://github.com/JuliaLang/julia/pull/28027 instead. To also cross-post myself:
reduce
callsmapreduce(identity, ...)
, andmean
callssum
, which callsreduce
. I'm not sure why StaticArrays would have to reimplement all of this, since that code should work for anyAbstractArray
.The
eachslice
approach is also useful, but I don't see it as a good alternative since it cannot be efficient when reducing over a non-contiguous slice (e.g. rows for a column-major matrix), contrary to themapreduce
implementation which processes entries in the most efficient order, accumulating results in parallel for all dimensions if needed. Alsoeachslice
is significantly more verbose and the syntax is quite different from thesum(x, dims=i)
one that people are encouraged to use. I think both could coexist:eachslice
is more flexible, but sometimes slower.
So essentially this makes SkipMissing
into a general wrapper which behaves like the wrapped collection but indicates to any functions operating on it that they should ignore missings rather than treating them as poisonous, which is the default.
It would behave as much as possible like the wrapped collection, but the behaviours would be identical only in the absence of missing values. When there are missing values, keys
would skip some indices, and getindex
would throw an error accessing them. There would still be other differences, e.g. collect
cannot return a multidimensional array since skipping missing values would introduce "empty" entries.
Maybe it's time to revisit indexing iterators in general? A lot of algorithms that work on AbstractArrays also theoretically work on iterators over AbstractArrays (zips, products, generators). This is hard to express semantically in type system. Maybe it makes more sense to rely on an Indexable
trait in most situations.
Please try to stay on topic and not hijack the issue into another discussion.
This mini Julep aims to address issues which currently block progress regarding two essential use cases of
skipmissing
. The first one is how to compute reductions over dimensions of an array, skipping missing values (https://github.com/JuliaLang/julia/pull/28027). The other is how to find the index of the maximum/minimum value in an array, skipping missing values (https://github.com/JuliaLang/julia/issues/29305).The solution I suggest is to make
SkipMissing
an "enhanced iterator" which would also support indexing. More specifically, it would use the same indices as the original array:keys(itr::SkipMissing)
would return(i for i in keys(itr.x) if !ismissing(itr.x[i]))
, or an equivalent iterator.getindex
using indices of the original array would either return the corresponding value, or throw an error if the value is missing:getindex(itr::SkipMissing, i...) = (v = itr.x[i...]; ismissing(v) ? throw(...) : v)
Consistently with this proposal,
argmax(skipmissing(x))
would return the indexi
so thatx[i]
is the highest non-missing value inx
(fixing https://github.com/JuliaLang/julia/issues/29305). Andreduce(skipmissing(x), +, dims=i)
would compute the sum of non-missing values over dimensioni
ofx
, with the same shape asreduce(x, +, dims=i)
(fixing https://github.com/JuliaLang/julia/pull/28027). In both cases, ifx
contains no missing values, the result would be indistinguishable from applying the operation directly tox
.IteratorSize(SkipMissing)
would still returnSizeUnknown
, since computing the length is an O(N) operation.