JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.77k stars 5.49k forks source link

Allow `Base.filter` to work with all iterators #31188

Open pdeffebach opened 5 years ago

pdeffebach commented 5 years ago

This issue is motivated by this discussion on discourse.

My example of when this is a problem is when you are working with a subset of data that, by chance, does not contain missing values. When you work with the full data, you now have missing values and have to use skipmissing at various places in your code. Now all your filter commands are broken.

A heavy and breaking change would be to make Base.filter work on any iterator, and not just arrays. Other alternatives include pointing more people towards Iterators.filter for general use, or special-casing skipmissing so a collect call is included somewhere.

ararslan commented 5 years ago

One issue here is that filter is eager and, generally speaking, typeof(filter(f, x)) == typeof(x). That isn't really possible to implement in a general case, but I think it would be okay to define filter for SkipMissing to materialize and filter.

nalimilan commented 5 years ago

So I guess in that proposal filter(::SkipMissing{<:AbstractArray}) would work, but not with other wrapped iterators?

ararslan commented 5 years ago

That would make sense to me. SkipMissing is an odd enough case itself that I think special treatment of it would be fine. We could define it as

filter(f, s::SkipMissing) = filter(x->!ismissing(x) && f(x), s.x)

then let filter fail if the underlying iterator doesn't have filter defined on it.

pdeffebach commented 5 years ago

An alternative is that people could just change their filter commands to have ismissing, which is a bit more explicit anyways.

This could just be a documentation / tutorial problem.

ararslan commented 5 years ago

I was looking at implementing what I said above verbatim, but I realized that that definition leaves you with a container without any missing values but with an element type that allows missing. That seems suboptimal and is inconsistent with the result of filter(f, collect(s)).

This is difficult to do generically, since ideally you'd want to be able to use filter on any SkipMissing{T} where T supports filter and get a T back, but to provide useful and consistent behavior, I'm not sure how we'd swing it for anything but arrays...

nalimilan commented 5 years ago

Supporting filter only for AbstractArray sounds OK to me.

ararslan commented 5 years ago

Implemented in #31235.

nalimilan commented 5 years ago

One possible rationale for supporting filter(::SkipMissing{<:AbstractArray}) and not other methods is that SkipMissing{<:AbstractArray} is indexable. If an Indexable trait was added as discussed at https://github.com/JuliaLang/julia/pull/31020#discussion_r255292414, it would make sense for filter to support all types with that trait. That would include in particular Broadcasted objects.

ararslan commented 5 years ago

We can't support filter for arbitary iterables and we now have filter for SkipMissing-wrapped arrays, so is there anything actionable left or can this be closed now that #31235 is merged?

nalimilan commented 5 years ago

As I noted it could make sense to also support Broadcasted. Not sure whether that warrants keeping this open.

ararslan commented 5 years ago

Ah right, sorry. It might make sense to have that as a separate issue, since it's very specific (and a difficult situation to find yourself in, since presumably a broadcast will materialize before filter is called).

pdeffebach commented 5 years ago

Broadcasted would materialize an array while lifting missing elements? Would be great!

nalimilan commented 5 years ago

Yes that's in the context of https://github.com/JuliaLang/julia/pull/31088.

ssfrr commented 4 years ago

I just ran into this trying to filter a zip, which also doesn't work.

filter(x->x[2] > 0.5, zip(rand(10), rand(10)))

Is the main objection to having filter work for arbitrary iterators that we can't then guarantee that typeof(filter(f, x)) == typeof(x)? It doesn't seem that map makes that guarantee:

map(sum, zip(rand(10), rand(10)))

This returns a Vector{Float64}. It seems like returning a Vector{eltype(x)} in the case that x is an iterator seems like a pretty reasonable behavior.

ararslan commented 4 years ago

It doesn't seem that map makes that guarantee

Sure, but that's not the same operation. map is explicitly a transformation of the data into something different, while filter gives you back a subset of the existing data.

It seems like returning a Vector{eltype(x)} in the case that x is an iterator seems like a pretty reasonable behavior

The problem is defining what an "iterator" is. For example, one can iterate a tuple, but I'd be surprised if e.g. filter(isodd, (1,2,3)) gave me back an array.

nalimilan commented 4 years ago

The problem is defining what an "iterator" is. For example, one can iterate a tuple, but I'd be surprised if e.g. filter(isodd, (1,2,3)) gave me back an array.

But why would it return an array? Just like map returns a tuple for tuples, filter could fall back to an array for general iterators, but return another type for some iterable types.

ssfrr commented 4 years ago

Sure, but that's not the same operation. map is explicitly a transformation of the data into something different, while filter gives you back a subset of the existing data.

map transforms the individual elements, so I'd expect that the eltype would change, but that seems separate from whether if affects the container type.