JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.78k stars 5.49k forks source link

Mini Julep: skipmissing indexing #30606

Open nalimilan opened 5 years ago

nalimilan commented 5 years ago

This mini Julep aims to address issues which currently block progress regarding two essential use cases of skipmissing. The first one is how to compute reductions over dimensions of an array, skipping missing values (https://github.com/JuliaLang/julia/pull/28027). The other is how to find the index of the maximum/minimum value in an array, skipping missing values (https://github.com/JuliaLang/julia/issues/29305).

The solution I suggest is to make SkipMissing an "enhanced iterator" which would also support indexing. More specifically, it would use the same indices as the original array:

Consistently with this proposal, argmax(skipmissing(x)) would return the index i so that x[i] is the highest non-missing value in x (fixing https://github.com/JuliaLang/julia/issues/29305). And reduce(skipmissing(x), +, dims=i) would compute the sum of non-missing values over dimension i of x, with the same shape as reduce(x, +, dims=i) (fixing https://github.com/JuliaLang/julia/pull/28027). In both cases, if x contains no missing values, the result would be indistinguishable from applying the operation directly to x.

IteratorSize(SkipMissing) would still return SizeUnknown, since computing the length is an O(N) operation.

tkoolen commented 5 years ago

From https://discourse.julialang.org/t/how-to-calculate-a-weighted-mean-with-missing-observations/19281/12?u=tkoolen, another option is to simply nix the dims keyword arguments and rely on eachslice instead. For example,

[mean(skipmissing(x)) for x in eachslice(A, dims=1)]

instead of mean(skipmissing(A), dims=1 with special methods for SkipMissing.

Quoting myself from https://discourse.julialang.org/t/how-to-calculate-a-weighted-mean-with-missing-observations/19281/14?u=tkoolen:

https://github.com/JuliaLang/julia/pull/28027 fixes mapreduce and the functions that use it like sum and prod, but you’d have to do the same for reduce. Andmean doesn’t actually use mapreduce as far as I can tell, so you’d have to fix that as well. How many others are there? On top of that, StaticArrays has to reimplement these functions with dims arguments as well, so there’s even more implementation cost and opportunity for bugs. All of this could be avoided by just not having the dims keyword arguments and using the eachslice API to separate these concerns.

nalimilan commented 5 years ago

Please discuss this at https://github.com/JuliaLang/julia/pull/28027 instead. To also cross-post myself:

reduce calls mapreduce(identity, ...), and mean calls sum, which calls reduce. I'm not sure why StaticArrays would have to reimplement all of this, since that code should work for any AbstractArray.

The eachslice approach is also useful, but I don't see it as a good alternative since it cannot be efficient when reducing over a non-contiguous slice (e.g. rows for a column-major matrix), contrary to the mapreduce implementation which processes entries in the most efficient order, accumulating results in parallel for all dimensions if needed. Also eachslice is significantly more verbose and the syntax is quite different from the sum(x, dims=i) one that people are encouraged to use. I think both could coexist: eachslice is more flexible, but sometimes slower.

StefanKarpinski commented 5 years ago

So essentially this makes SkipMissing into a general wrapper which behaves like the wrapped collection but indicates to any functions operating on it that they should ignore missings rather than treating them as poisonous, which is the default.

nalimilan commented 5 years ago

It would behave as much as possible like the wrapped collection, but the behaviours would be identical only in the absence of missing values. When there are missing values, keys would skip some indices, and getindex would throw an error accessing them. There would still be other differences, e.g. collect cannot return a multidimensional array since skipping missing values would introduce "empty" entries.

bramtayl commented 5 years ago

Maybe it's time to revisit indexing iterators in general? A lot of algorithms that work on AbstractArrays also theoretically work on iterators over AbstractArrays (zips, products, generators). This is hard to express semantically in type system. Maybe it makes more sense to rely on an Indexable trait in most situations.

KristofferC commented 5 years ago

Please try to stay on topic and not hijack the issue into another discussion.