JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.54k stars 5.47k forks source link

Base.unzip() #13942

Open ZacCranko opened 8 years ago

ZacCranko commented 8 years ago

Hi there,

apologies if this has already been addressed somewhere, but is there a reason that there is no unzip() function in Base?

Ideally this would be a function that would take a Vector{Tuple{ ... }} and return a Tuple{Vector, ..., Vector} for output. E.g.

julia> v = [(1,"a",:meow), (2,"b",:woof), (3,"c",:doh!)]; unzip(v)
([1,2,3],ASCIIString["a","b","c"],[:meow,:woof,:doh!])

A naive implementation might be something like

function unzip(input::Vector)
    n = length(input)
    types  = map(typeof, first(input))
    output = map(T->Vector{T}(n), types)

    for i = 1:n
       @inbounds for (j, x) in enumerate(input[i])
           (output[j])[i] = x
       end
    end
    return (output...)
end
bramtayl commented 5 years ago

Sorry, I didn't mean for that to sound short. I really appreciate all the work you're doing, and happy to collaborate on any solutions.

ghost commented 5 years ago

That's impressive. Can you explain, in general terms, how you do it? In my understanding that would require backtracking and converting some of the columns when you encounter a tuple with an unexpected length or type. And the result type could not be determined until the operation is complete, no?

bramtayl commented 5 years ago

Yes and yes. push_widen and setindex_widen_up_to already do this in collect for Base (and I outsource to them). Whether or not the result type can be determined depends on how smart inference is being that day, but the short answer is that because Julia is a dynamic language, it doesn't matter anyway.

bramtayl commented 5 years ago

But I do have a doctest for inference:

julia> stable(x) = (x, x + 0.0, x, x + 0.0, x, x + 0.0);

julia> @inferred unzip(Generator(stable, 1:4))

So, in some cases, inference can figure out what's going on just fine.

ghost commented 5 years ago

But in general, this is not type-stable, I guess, especially with arrays rather than generators?

bramtayl commented 5 years ago

Yes. Which is ok, that's par for the course in Julia.

ghost commented 5 years ago

Thanks for the explanations! Gives me a lot to think about.

Last question: Do you know why this behavior is not implemented for collect? Is it just that nobody got around to it yet, or is there some other hidden cost to consider?

bramtayl commented 5 years ago

I'm not sure I'm qualified to answer that question, but I think #25553 is the place to start looking

bramtayl commented 5 years ago

The answer is here https://github.com/JuliaLang/julia/pull/26501#issuecomment-373957902 I think

ghost commented 5 years ago

Hmm... correct me if I'm wrong, but #26501 (comment) and #26501 (comment) seems to suggest that diverging tuple fields widening to Any rather than Union is by design (because compile times). So the tradeoff here is partly in how much we want to tax the compiler.

@StefanKarpinski, could you chime in with an opinion?

o314 commented 5 years ago

Notice that there is some mismatch between what we expect from iterator protocol v1

annex - unzip as iterators dot

And how unzip works internally

annex - unzip as parallelized-iterator dot

Without seekable io / indexable iterator and/or buffering, that could never fit.

bramtayl commented 4 years ago

I took at stab at simplifying the code in my gist. I made a little headway but not much. If it's any comfort, many of the functions (map_unrolled, partial_map, zip_missing) would fit nicely into a unexported mini-module for unrolling tuple operations. I've been kicking around a plan for that for a long time.

oxinabox commented 4 years ago

It would be really nice to get this solved for 1.6.

JeffreySarnoff commented 4 years ago

@StefanKarpinski I remember that you had a full grasp of this -- is anything of deep import holding it back?

StefanKarpinski commented 4 years ago

Only performance, Iirc.

ghost commented 4 years ago

@JeffreySarnoff, I may be misremembering some things, but I think there were some unresolved questions aside from performance:

  1. What to do if tuples have different length?
    • Fill in missings?
    • Throw error?
  2. What to do if the same field has different types in different tuples?
    • Error?
    • Promote to Any?
    • Promote to smallest possible Union?
    • Promote to parent type?
    • Widen numeric types?
  3. Lazy or strict semantics?
    • In particular, do we want unzip of zip be the identity? (Rather than throwing an error if the input arrays are of different size, as zip currently does).

Maybe some other things too, but it's been a while since I read through the entire thread.

Note that these issues only really arise in TypeUnknown iterators. For homogeneous tuple arrays there is the Destruct package, which implements a simple and efficient unzip operation.

bramtayl commented 4 years ago

33324 seems pretty good to me still. If we want to simplify the code I think the first step would be to solve #31909

bramtayl commented 4 years ago

I wrote up https://github.com/bramtayl/Unzip.jl and it should be registered 3 days from now.

JeffreySarnoff commented 4 years ago

afaic a clear candidate after looking at dump(zip(avec, bvec))

bramtayl commented 4 years ago

Feel free to open up a PR over there, but I'm not so sure about that method. The behavior in my version unzip is to collect new vectors, not pass along old ones. z.is also pretty easy to type if you already have a zip.

ghost commented 4 years ago

iknow uknow this is furiously obvious

* `unzip` that which is `zip`ped
  unzip(x::Base.Iterators.Zip) = z.is

How is this obvious? It's one possible way to define unzip of zip, and has the (IMO major) disadvantage that unzip is not guaranteed to return arrays of equal length.

JeffreySarnoff commented 4 years ago

(it works better with z.is replaced by x.is) It will return the arrays that were given to zip. How is that a major disadvantage for unzip? I conceived that its essential purpose is to recover what zip got.

Certainly, the unzip of a collected zip operates on equilength firsts and .. and lasts, irrespective of what zip got. And that is appropriate, as one can only process information that exists when the processing occur. The most performant approaches to unzip(x::Vector{Tuple{_}}) cannot be expected to perform as quickly as does doing x.is.

To forgo that very high throughput advancing some computation because it provides unequal length vectors iff unequal length vectors were given to zip seems unnecessary. Where that is undesirable, one may apply e.g.rtrim_longer_vecs(unzip(zipped)) and obtain equi-sized vectors.

JeffreySarnoff commented 4 years ago

@zzyxyzz I meant "obvious" in the sense that this is what dump(zip(a,b)) shows. I had not known about the inner structure of our zips until today .. so, I revised that phrase: "afaic a clear candidate after looking at dump(zip(avec, bvec))".

mschauer commented 4 years ago

return arrays of equal length

good point

bramtayl commented 4 years ago

Maybe parents would be a good name for what function (x::Base.Iterators.Zip) = x.is end does

ghost commented 4 years ago

@JeffreySarnoff, I'm not saying that making unzip(zip(...)) a no-op is necessarily a bad idea. It's just that unzip has a surprising number of loose ends and subtle trade-offs for something that is so simple in principle. This is marked as "good first issue", but it really isn't. Over the years several people have taken a stab at it (@bramtayl has come furthest), but it's no accident that there is still no generally accepted solution.

ghost commented 4 years ago

Maybe this issue should be renamed, following Julia tradition, "Taking unzip seriously". :stuck_out_tongue:

bramtayl commented 4 years ago

Registered!

JeffreySarnoff commented 4 years ago

For unzip(x::Base.Iterators.Zip) where the lengths of x.is are equal, would you rather return x.is or a copies of the constituents?

ghost commented 4 years ago

@JeffreySarnoff

For unzip(x::Base.Iterators.Zip) where the lengths of x.is are equal, would you rather return x.is or a copies of the constituents?

My personal take on this:

In any decent implementation, unzip should call its source iterator exactly once for each input tuple. This forces unzip to construct the output arrays in lock-step; it cannot finish the first array before starting on the second. Thus, unzip cannot act as a lazy iterator in any reasonable sense. It is fundamentally a collect-like operation. So the safe, consistent thing to do would be to emulate the behavior of collect and always return freshly allocated arrays.

But I do admit that the temptation to special-case unzip(::Base.Iterators.Zip) is strong. And it wouldn't be the first time that safe, predictable behavior is sacrificed for performance in Julia. :smile:

o314 commented 2 years ago

No post in 2021 ? So let's give it a try now! HNY 2022 Unzip !

Multidispatch-friendly zen of unzip in julia

@ https://gist.github.com/o314/214e26c6fb70512b56597d633dd87e6f
see https://github.com/JuliaLang/julia/issues/13942

OK unzip(a) = zip(a...) fails in Julia But Zen of python is great - ok, it's somewhat a lie, it's also great for julia ! So let's try to bring correct simple things first, and (maybe) complicated ones later or away.

FWIW, my good-enough-poor-man, but usable in prod, isn't it, unzip considering than

using Test
import Base.Iterators as _I

unzip(s...) = unzip(collect(s))
unzip(vs::Vector{<:Vector}) =
    let M=length(vs), N=mapfoldl(length, min, vs); # todo remove me when SVector is in Base
        ([vs[i][j] for i in 1:M] for j in 1:N)
    end
unzip(a::Vector{<:Pair}) = [k for (k,_) in a], [v for (_,v) in a]

TEST

import Base.Iterators as _I
using Test
# zipdata(M,N) = let v=collect(1:M), vt=ntuple(N) do _; copy(v) end; vt end
data(M,N) = ntuple(M) do i; fill(i,N) end
data(N) = let ks=_I.take(_I.cycle('a':'z'), N), vs=(1:N...,); (k=>v for (k,v) in zip(ks,vs)) end

VALIDITY TEST

# unzip of vector
@test data(5,3) == ([1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5])
@test unzip(data(5,3)...) |> collect == ([1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5]) |> collect

# unzip of pair vector
@test data(5) |> collect == ('a'=>1, 'b'=>2, 'c'=>3, 'd'=>4, 'e'=>5) |> collect
@test unzip(data(5) |> collect) |> collect == (['a','b','c','d','e'], [1,2,3,4,5]) |> collect

SPEED TEST

# unzip of vector
julia> @time unzip(data(1000,3)...);
  0.029086 seconds (42.07 k allocations: 2.766 MiB, 99.28% compilation time)

julia> @time unzip(data(1_000_000,3)...);
  1.507531 seconds (4.04 M allocations: 223.797 MiB, 18.21% gc time, 72.32% compilation time)

julia> @time unzip(data(1_000_000,3)...);
  0.294386 seconds (1.00 M allocations: 152.588 MiB)

julia> @time unzip(data(1000,50)...);
  0.000727 seconds (1.01 k allocations: 531.922 KiB)

julia> @time unzip(data(1_000_000,50)...);
  1.082615 seconds (1.00 M allocations: 518.799 MiB, 48.09% gc time)

julia> @time unzip(data(1_000_000,50)...);
  0.527460 seconds (1.00 M allocations: 518.799 MiB)

# unzip of pair vector
julia> @time unzip(data(1000));
  2.728774 seconds (166.12 k allocations: 10.524 MiB, 99.98% compilation time)

julia> @time unzip(data(1000));
  0.000334 seconds (2.00 k allocations: 116.375 KiB)

julia> @time unzip(data(1_000_000) |> collect);              # BUG wo collect
  0.634841 seconds (3.50 M allocations: 178.888 MiB, 18.39% gc time, 57.41% compilation time)
LilithHafner commented 2 years ago

Hmm... @o314, that's not working for me:

julia> unzip(zip(1:10, 1:10))
Base.Generator{UnitRange{Int64}, var"#7#9"{Vector{Vector{Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}}}, Int64}}(var"#7#9"{Vector{Vector{Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}}}, Int64}(Vector{Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}}[[zip(1:10, 1:10)]], 1), 1:1)

julia> for i in unzip(zip(1:10, 1:10))
           println(i)
       end
Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}[zip(1:10, 1:10)]

julia> x = collect(zip(1:5, 2:2:10))
5-element Vector{Tuple{Int64, Int64}}:
 (1, 2)
 (2, 4)
 (3, 6)
 (4, 8)
 (5, 10)

julia> collect(unzip(x))
5-element Vector{Vector{Tuple{Int64, Int64}}}:
 [(1, 2)]
 [(2, 4)]
 [(3, 6)]
 [(4, 8)]
 [(5, 10)]
complyue commented 2 years ago

Per 1.7.2, the zip(...) trick would not even scale to 1k!

$ julia --version
julia version 1.7.2
$ time julia -e 'zip(collect([3,2,5] for _ in 1:10)...)|>collect' 
julia -e 'zip(collect([3,2,5] for _ in 1:10)...)|>collect'  0.38s user 0.09s system 98% cpu 0.474 total
$ time julia -e 'zip(collect([3,2,5] for _ in 1:1000)...)|>collect'
ERROR: StackOverflowError:
Stacktrace:
 [1] _zip_iterate_interleave(xs1::NTuple{980, Tuple{Int64, Int64}}, xs2::Tuple{}, ds::NTuple{980, Missing})
   @ Base.Iterators ./iterators.jl:368
 [2] _zip_iterate_interleave (repeats 20 times)
   @ ./iterators.jl:369 [inlined]
 [3] _zip_iterate_all(is::NTuple{1000, Vector{Int64}}, ss::NTuple{1000, Tuple{}})
   @ Base.Iterators ./iterators.jl:354
 [4] iterate
   @ ./iterators.jl:340 [inlined]
 [5] copyto!(dest::Vector{NTuple{1000, Int64}}, src::Base.Iterators.Zip{NTuple{1000, Vector{Int64}}})
   @ Base ./abstractarray.jl:890
 [6] _collect(cont::UnitRange{Int64}, itr::Base.Iterators.Zip{NTuple{1000, Vector{Int64}}}, #unused#::Base.HasEltype, isz::Base.HasShape{1})
   @ Base ./array.jl:655
 [7] collect(itr::Base.Iterators.Zip{NTuple{1000, Vector{Int64}}})
   @ Base ./array.jl:649
 [8] |>(x::Base.Iterators.Zip{NTuple{1000, Vector{Int64}}}, f::typeof(collect))
   @ Base ./operators.jl:966
julia -e 'zip(collect([3,2,5] for _ in 1:1000)...)|>collect'  130.60s user 3.81s system 99% cpu 2:14.47 total

Hell I miss an unzip!

ParadaCarleton commented 1 year ago

Someday unzip will work. Eventually :sweat_smile:

adienes commented 7 months ago

note that since https://github.com/JuliaLang/julia/pull/50435 was merged, there is precedent to throw on follow-up operations when zipped iterators have unequal lengths. so many of the previous concerns upthread about the desire to & controversy of defining unzip(z::Zip) = zip.is are addressed & no longer super relevant

w.r.t. returning the original iterators or copies, I definitely prefer not to copy. all unzip needs to promise is a tuple of iterators, and if the user wants to copy when the method would otherwise not she can always write collect.(unzip(...))