JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.54k stars 5.47k forks source link

Base.unzip() #13942

Open ZacCranko opened 8 years ago

ZacCranko commented 8 years ago

Hi there,

apologies if this has already been addressed somewhere, but is there a reason that there is no unzip() function in Base?

Ideally this would be a function that would take a Vector{Tuple{ ... }} and return a Tuple{Vector, ..., Vector} for output. E.g.

julia> v = [(1,"a",:meow), (2,"b",:woof), (3,"c",:doh!)]; unzip(v)
([1,2,3],ASCIIString["a","b","c"],[:meow,:woof,:doh!])

A naive implementation might be something like

function unzip(input::Vector)
    n = length(input)
    types  = map(typeof, first(input))
    output = map(T->Vector{T}(n), types)

    for i = 1:n
       @inbounds for (j, x) in enumerate(input[i])
           (output[j])[i] = x
       end
    end
    return (output...)
end
StefanKarpinski commented 8 years ago

We definitely need an unzip function in Base.

ivarne commented 8 years ago

Or even simpler

unzip(a) = zip(a...) 
yuyichao commented 8 years ago

No

StefanKarpinski commented 8 years ago

Trying the zip(...) thing on any non-toy problem shows that it's not so good pretty fast, unfortunately.

tlnagy commented 8 years ago

I just ran into this issue when I absentmindedly used the Python-esque zip(arr...) where length(arr)=6000. It was not pretty. A dedicated unzip would be nice. Maybe even throw a warning that you shouldn't splat anything nontrivial.

yuyichao commented 8 years ago

Maybe even throw a warning that you shouldn't splat anything nontrivial.

Like almost all other performance tips in julia, doing so is perfectly fine and sometimes wanted for various reasons. (OK, splatting 6000 elements is a bit crazy, but splatting unknown size can be useful....). The only thing that matters is to not put it in performance critical path.

It looks like this is not covered in the performance tip?

tlnagy commented 8 years ago

Like almost all other performance tips in julia, doing so is perfectly fine and sometimes wanted for various reasons. (OK, splatting 6000 elements is a bit crazy, but splatting unknown size can be useful....). The only thing that matters is to not put it in performance critical path.

Yeah, a warning is a bad idea.

It looks like this is not covered in the performance tip?

I think this should also be marked under differences from other languages. People coming from Python are probably used to the zip(*arr) syntax and zip(arr...) might be a tempting target.

hayd commented 8 years ago

You can dispatch only on arrays of tuples:

function unzip{T<:Tuple}(A::Array{T})
    res = map(x -> x[], T.parameters)
    res_len = length(res)
    for t in A
        for i in 1:res_len
            push!(res[i], t[i])
        end
    end
    res
end
StefanKarpinski commented 7 years ago

Bump, we still need and unzip function.

StefanKarpinski commented 7 years ago

BUMP if anyone is bored...

JeffreySarnoff commented 7 years ago

function unzip(zipped::Base.Iterators.Zip2{Array{T1,1}, Array{T2,1}}) where {T1,T2}
    n = length(zipped)
    vec1 = Vector{T1}(n)
    vec2 = Vector{T2}(n)
    i = 1
    for pair in enumerate(zipped)
        vec1[i], vec2[i] = pair[2]
        i = i+1
    end
    return vec1, vec2
end
JeffreySarnoff commented 7 years ago

maybe <= 1.5x is

function unzip(zipped::Base.Iterators.Zip2{Array{T1,1}, Array{T2,1}}) where {T1,T2}
    return map(first,zipped), map(last,zipped)
 end
martinholters commented 7 years ago
unzip(zipped::Base.Iterators.Zip2) = (zipped.a, zipped.b)

But all those differ in how they treat length(a) != length(b). I think I'd favor the behavior of the first.

This trivial case apart, the desired semantics are not absolutely clear to me: Given an iterator itr, should unzip(itr) return an iterator unzipped, that generates iterators unzipped[1] to unzipped[N], where N is the length of the iterators generated by itr? What if the first 10000 values generated by itr are Tuples of length N==3, but then comes a Tuple of length 2? Throw when iterating at least one unzipped[n] up to that point? Or eagerly collect itr the moment unzip is called?

Think about

unzip(f(n) for n in Base.Iterators.countfrom(0))

for different fs to get an idea of my open questions.

spalato commented 6 years ago

There seems to be multiple use cases for an unzip, with different requirements.

A common case is involves Arrays of Tuples, in which the size is known and array shape must be preserved. Some suggestions here do not obey this constraint.

I have come up with a solution for this specific case using Base.Cartesian. It's up to twice as fast as the alternative commonly suggested alternative for NTuples, and more than 10x for heterogeneous tuples. See: https://github.com/spalato/Destruct.jl Feel free to use and steal.

bramtayl commented 5 years ago

Can we get this for generalized iterators and not just arrays? Basically just a collect_many modeled after collect?

bramtayl commented 5 years ago

I could give it a stab? The trickiest bit (I think) will be getting an equivalent of default_eltype for zero length eltype unknown iterators

StefanKarpinski commented 5 years ago

Go for it, @bramtayl!

bramtayl commented 5 years ago

Tada, this seems to work. Great excuse not to finish my grading.


struct Unzip{Iterator}
    iterator::Iterator
end

iterate(iterator::Unzip) = iterate(iterator.iterator)
iterate(iterator::Unzip, state) = iterate(iterator.iterator, state)

IteratorEltype(iterator::Unzip) = IteratorEltype(iterator.iterator)
eltype(iterator::Unzip) = eltype(iterator.iterator)

IteratorSize(iterator::Unzip) = IteratorSize(iterator.iterator)
axes(iterator::Unzip) = axes(iterator.iterator)
size(iterator::Unzip) = size(iterator.iterator)
length(iterator::Unzip) = length(iterator.iterator)

struct ModelArray{ElementType, NumberOfDimensions, Model, Rest <: Tuple} <:
    AbstractArray{ElementType, NumberOfDimensions}
    model::Model
    rest::Rest
end

model_array(model, rest...) =
    ModelArray{
        Tuple{eltype(model), eltype.(rest)...},
        ndims(model),
        typeof(model),
        typeof(rest)
    }(model, rest)

axes(array::ModelArray) = axes(array.model)
size(array::ModelArray) = size(array.model)

IndexStyle(array::ModelArray) = IndexLinear()

@propagate_inbounds getindex(array::ModelArray, index::Int) = (
    getindex(array.model, index),
    getindex.(array.rest, index)...
)

@propagate_inbounds setindex!(array::ModelArray, value::Tuple, index::Int) = (
    setindex!(array.model, value[1], index),
    setindex!.(array.rest, tail(value), index)...
)

push!(array::ModelArray, value::Tuple) = (
    push!(array.model, value[1]),
    push!.(array.rest, tail(value))
)

# TODO: use fieldtypes instead of value_field_types in a future with more constant propagation
value_field_types(a_type) =
    ntuple(index -> Val(fieldtype(a_type, index)), fieldcount(a_type))

function similar(
    array::Unzip,
    ::Type{ElementType},
    dims::Dims
) where {ElementType, NumberOfDimensions}

    inner_array(::Val{InnerElementType}) where InnerElementType =
        Array{InnerElementType}(undef, dims)
    model_array(inner_array.(value_field_types(ElementType))...)
end

# Unzip needs to replicate AbstractArray similar machinery
similar(a::Unzip, ::Type{T}) where {T}                     = similar(a, T, to_shape(axes(a)))
similar(a::Unzip{T}, dims::Tuple) where {T}                = similar(a, T, to_shape(dims))
similar(a::Unzip{T}, dims::DimOrInd...) where {T}          = similar(a, T, to_shape(dims))
similar(a::Unzip, ::Type{T}, dims::DimOrInd...) where {T}  = similar(a, T, to_shape(dims))
similar(a::Unzip, ::Type{T}, dims::Tuple{Union{Integer, OneTo}, Vararg{Union{Integer, OneTo}}}) where {T} = similar(a, T, to_shape(dims))

similar(array::ModelArray, ::Type{ElementType}, dims::Dims) where ElementType =
    similar(Unzip(nothing), ElementType, dims)

_similar_for(c::Unzip{Nothing}, T, itr, ::SizeUnknown) =
    similar(c, T, 0)
_similar_for(c::Unzip{Nothing}, T, itr, ::HasLength) =
    similar(c, T, Int(length(itr)::Integer))
_similar_for(c::Unzip{Nothing}, T, itr, ::HasShape) =
    similar(c, T, axes(itr))

collect(iterator::Unzip) = _collect(
    Unzip(nothing),
    iterator,
    IteratorEltype(iterator),
    IteratorSize(iterator)
)

# Examples

Generator(x -> (x, x + 1.0), [1]) |> Unzip |> collect
Generator(x -> (x, x + 1.0), [1, missing]) |> Unzip |> collect
zip([1], [1.0]) |> Unzip |> collect
[(1, 1.0)] |> Unzip |> collect
Filter(x -> x[1] == x[2], [(1, 1.0)]) |> Unzip |> collect
``
bramtayl commented 5 years ago

The above strategy turns out to be extremely sub-optimal for EltypeUnknown iterators, but it still works. I'm not sure I have the programming wherewithall to design a better system.

bramtayl commented 5 years ago

I'm thinking it would probably just need specialized ::ModelArray methods for grow_to! and collect_to! to be efficient, I'll see what I can cook up

bramtayl commented 5 years ago

Ok now that #30076 has gone through here are the last two methods to make this efficient (I think):

import Base: setindex_widen_up_to

maybe_setindex_widen_up_to(dest::AbstractArray{T}, el, i) where T = 
    if isa(el, T)
        @inbounds dest[i] = el
        dest
    else 
        setindex_widen_up_to(dest, el, i)
    end

setindex_widen_up_to(dest::ModelArray, el, i) = 
    model_array(maybe_setindex_widen_up_to.((dest.model, dest.rest...), el, i)...)

Should this go into base? Live as a package? If so, any comments before I cook up a PR?

bramtayl commented 5 years ago

I'm leaving the code here for now https://github.com/bramtayl/Unzip.jl

JeffreySarnoff commented 5 years ago

Is there a reason that it does not build?

bramtayl commented 5 years ago

It should only work on master

oxinabox commented 5 years ago

Can this become a PR, and then get merged? Some docs would also need to be updated

bramtayl commented 5 years ago

I'm not sure I have the programming chops to make a PR. Or rather, I have some questions: what file should the code go into? What modifications do I need to make to take into account bootstrap order? Do I need boundschecking in the ModelArray constructor to make sure all the inputs are the same size? How can I adjust the code to take into account offset axes? What would be a good way to effectively benchmark this? How can I tell if the widening optimizations are working as intended?

bramtayl commented 5 years ago

Also, I've run into a bit of a problem. Not sure what do in the case of EltypeUnknown, SizeUnknown iterators where @default_eltype yields Union{}. We need to know at least the length of the tuple to start unzipping... I suppose, for an inference free implementation, we'd have to require the tuple length explicitly... ugh...

mschauer commented 5 years ago

I think it is okay to make some assumption on the iterator which is to be unzipped (e.g. not returning tuples of different length) and it would okay to give an error message in the case this contract is not fulfilled.

o314 commented 5 years ago

Well, well. take that in python, write this in julia:

s1,s2 = (1,2,3),('a','b','c')
s = tuple(zip(s1,s2)...)

@test tuple(zip(s...)...) == (s1,s2)
Test Passed

so

unzip(z) = tuple(zip(z...)...)

@test (unzip ∘ zip)(s1,s2) == (s1,s2)
Test Passed

One more impl, quickly usable. Let's call it the haiku version of julia unzip

PS: edited v1 w array removed, was not type stable. v2 w tuple seems ok

KristofferC commented 5 years ago

https://github.com/JuliaLang/julia/issues/13942#issuecomment-155668226 and the comment below.

StefanKarpinski commented 5 years ago

We have an implementation; all that's required is for someone to make a PR to add it to Base. The issue is already tagged as "good first issue" and "help wanted" so hopefully someone feels like trying it.

o314 commented 5 years ago

@KristofferC

Sorry, code was so short it pass under my radar

using Test
unzip(z) = tuple(zip(z...)...)

s,t = [1:1_000_000...],[1:1_000_000...];
z = zip(s,t) |> collect;
@test_throws StackOverflowError unzip(z)

s,t = [1:100...],[1:100...];
z = zip(s,t) |> collect;
@time unzip(z);
# 4.105195 seconds (26.41 M allocations: 1.021 GiB, 17.28% gc time)

it's an useless toy

StefanKarpinski commented 5 years ago

Yes, that's been noted:

Trying the zip(...) thing on any non-toy problem shows that it's not so good pretty fast, unfortunately.

I was referring to @bramtayl's Unzip package, which has a real implementation... but also appears to have been deleted from GitHub. So I guess we do not actually have an implementation. @bramtayl's is there any chance you can resurrect your Unzip code? Was there some reason you deleted it?

o314 commented 5 years ago

Of course, that's was clearly understandable. I was just answering to Kristoffer about the splat option. Didn't figure numbers were so bad.

Thanks for pointing out the correct track. Still a bit hard for me at this point to contribute to Base. Good luck to other people...

StefanKarpinski commented 5 years ago

No worries. I didn't bother measuring the splatting option because I knew it was going to be terrible 😬

o314 commented 5 years ago

OK. That is surprisingly pretty good however in python

def unzip(z): return zip(*z)

def bench_zip_unzip():
    z = [*zip(range(1_000_000), range(1_000_000))]
    u = [*unzip(z)]

import timeit
timeit.timeit(bench_zip_unzip, number=1)
# 0.4816088059997128 second
simonbyrne commented 5 years ago

one other thing that would be useful would be to allow it to take a function first argument, so that you could do e.g.

s,c = unzip(sincos, x)

for a vector (or array) x.

StefanKarpinski commented 5 years ago

Copying some thoughts from a thread on Slack, the utility of unzip is that it works like collect does on a single iterable, materializing it into a concrete vector but producing as many vectors as there are elements in each tuple that the iterable yields. The iterable must yield the same sized tuple each time; it need not, however, yield the same element types each time; unzip should handle this the same way that collect or comprehensions do (they're slightly different, so I guess that's a design choice), for each vector that it outputs. So the result of unzip(itr) should be the same as

ntuple(length(first(itr))) do i
    [t[i] for t in itr]
end

except that it only consumes itr a single time and it should error if any of the tuples has a different length than the first one. That will be a bit tricky to write 🤔. Mocked up example:

julia> itr = [(1,2,3),(3,1.5,'x')]
2-element Array{Tuple{Int64,Real,Any},1}:
 (1, 2, 3)
 (3, 1.5, 'x')

julia> unzip(itr)
([1, 3], Real[2, 1.5], Any[3, 'x'])
bramtayl commented 5 years ago

See #30987

StefanKarpinski commented 5 years ago

Thanks, @bramtayl!

o314 commented 5 years ago

It will be good if we can unzip Array{Pair}, Base.Iterators.Pairs too. eg.

unzip([k=>v for (k,v) in zip('a':'c',1:3)]) == unzip(zip('a':'c',1:3))
JeffreySarnoff commented 5 years ago

bump

bramtayl commented 5 years ago

https://gist.github.com/bramtayl/2d6d323582f7c3158f9a1b5299997d68

Not skilled enough to make a PR, but I think this should be a pretty robust implementation

ghost commented 5 years ago

Would this be considered an acceptable solution?

unzip(z::Iterators.Zip) = z.is
unzip(itr) = unzip(collect(itr))
unzip(::AbstractArray) = error("Can only unzip array of tuples.")

@generated function unzip(a::AbstractArray{T}) where {T<:Tuple}
    types = T.types
    types[end] <: Vararg && error("Tuples in the array should have the same length.")
    isempty(types) && return :(())
    n = length(types)
    quote
        @nexprs $n i -> out_i = similar(a,$types[i])
        @inbounds for j in eachindex(a)
            @nexprs $n i -> out_i[j] = a[j][i]
        end
        return @ntuple $n out
    end
end

90% of this implementation is taken from the package Destruct.jl by @spalato, who commented earlier in this thread.

This implementation is maximally efficient for dense arrays of tuples (which was the motivating example for this issue). It is slightly less efficient for arrays with non-standard indexing, due to reliance on eachindex.

For the Zip iterator, it just returns the underlying iterators without truncating them to a common size. I.e. a,b,c = unzip(zip(a,b,c)) is a no-op.

For other iterators and generators, it first collects the output into an array and unzips that. This is not ideal, since it requires twice as much memory, but I simply don't have enough experience with iterator internals to implement it better. (This is also why I haven't tried to ressurect @bramtayl's implementation: I simply couldn't wrap my head around it yet :smile:).

AFAICT, the above code fulfils all the functional requirements set out by @StefanKarpinski above. If this implementation is deemed acceptable, despite using generated functions and sacrificing efficiency in some cases, I can add tests and documentation and open a PR.

bramtayl commented 5 years ago

The code in the gist was written relatively recently, so I think it should be copy-pastable? It sounds like its a lot more fully featured than what your proposing (works with arbitrary iterators, returning a variable number of items). Just need someone to figure out the PR details (something I never seem to be able to do successfully)

bramtayl commented 5 years ago

I'm also happy to explain whatever is helpful. Lispy-tuple-recursion and iterative widening are both very useful to know how to do regardless.

ghost commented 5 years ago

It sounds like its a lot more fully featured than what your proposing (works with arbitrary iterators, returning a variable number of items).

Iterators are supported, but with some overhead. I think the main difference between the two variants is that you produce arrays with missing values if the tuples have different length. Stefan's comment implies that an error should be thrown in this case, but I can see how using missing values could be sometimes more useful. If this behavior is preferred, I think it can be implemented in the generating-function-based approach as well.

I'm also happy to explain whatever is helpful. Lispy-tuple-recursion and iterative widening are both very useful to know how to do regardless.

Thanks, I might have to take you up on that offer yet :smile:. But I really don't feel qualified to finish what you started there. I just spent several hours trying to understand your implementation and still only have a general idea of how it works. And that's without even touching on the missing and widening stuff.

I also have to ask, why is this style preferred in Julia (in array library code, at least)? As far as I can tell, the end result is that a specialized method is produced for every combination of tuple and container type that is invoked in user code. Which is more or less the same as what a generated function would do. On the other hand, a generated function does exactly what it advertises, while the other solution goes back and forth through twenty layers of abstraction (this is not a metaphor; I think the call chain really is this deep, even though it all gets inlined in the end). I'm sorry if this sounds argumentative, and I'm also starting to think this may be the wrong place to ask such questions, but I'm genuinely curious why layered abstractions relying on vararg splatting on every other step are preferred to gen-funcs in general.

bramtayl commented 5 years ago

Two things:

1) Collecting into a array first is unworkable when some of the columns are small unions, because the eltype will be Any 2) Because Julia is a dynamic language, there is no return type. So you can't look at an iterator and guess what types each of the columns will end up being, or for that matter, how many columns there are going to be in the first place.

ghost commented 5 years ago
  1. Collecting into a array first is unworkable when some of the columns are small unions, because the eltype will be Any

You're right about that, of course. Though I think you are holding yourself to a higher standard than the standard library (no pun intended). Maybe it would be preferable to use your proposed changes to improve collect, so that it doesn't widen types more than necessary? Then less skilled people, such as myself, could get away with falling back on it without too high a cost :smile:.

Also, if the "iterator" passed to unzip is an actual array, there is the same problem: [(1,"a"), (2,3)] infers as Array{Tuple{Int64,Any},1}. Does your implementation tighten the output type to Tuple{Array{Int64,1}, Array{Union{Int64,String},1}} in this case as well?

bramtayl commented 5 years ago

Yes