Open ZacCranko opened 8 years ago
Sorry, I didn't mean for that to sound short. I really appreciate all the work you're doing, and happy to collaborate on any solutions.
That's impressive. Can you explain, in general terms, how you do it? In my understanding that would require backtracking and converting some of the columns when you encounter a tuple with an unexpected length or type. And the result type could not be determined until the operation is complete, no?
Yes and yes. push_widen
and setindex_widen_up_to
already do this in collect for Base (and I outsource to them). Whether or not the result type can be determined depends on how smart inference is being that day, but the short answer is that because Julia is a dynamic language, it doesn't matter anyway.
But I do have a doctest for inference:
julia> stable(x) = (x, x + 0.0, x, x + 0.0, x, x + 0.0);
julia> @inferred unzip(Generator(stable, 1:4))
So, in some cases, inference can figure out what's going on just fine.
But in general, this is not type-stable, I guess, especially with arrays rather than generators?
Yes. Which is ok, that's par for the course in Julia.
Thanks for the explanations! Gives me a lot to think about.
Last question: Do you know why this behavior is not implemented for collect
? Is it just that nobody got around to it yet, or is there some other hidden cost to consider?
I'm not sure I'm qualified to answer that question, but I think #25553 is the place to start looking
The answer is here https://github.com/JuliaLang/julia/pull/26501#issuecomment-373957902 I think
Hmm... correct me if I'm wrong, but #26501 (comment) and #26501 (comment) seems to suggest that diverging tuple fields widening to Any
rather than Union
is by design (because compile times). So the tradeoff here is partly in how much we want to tax the compiler.
@StefanKarpinski, could you chime in with an opinion?
Notice that there is some mismatch between what we expect from iterator protocol v1
And how unzip works internally
Without seekable io / indexable iterator and/or buffering, that could never fit.
I took at stab at simplifying the code in my gist. I made a little headway but not much. If it's any comfort, many of the functions (map_unrolled, partial_map, zip_missing) would fit nicely into a unexported mini-module for unrolling tuple operations. I've been kicking around a plan for that for a long time.
It would be really nice to get this solved for 1.6.
@StefanKarpinski I remember that you had a full grasp of this -- is anything of deep import holding it back?
Only performance, Iirc.
@JeffreySarnoff, I may be misremembering some things, but I think there were some unresolved questions aside from performance:
missing
s?Any
?Union
?unzip
of zip
be the identity? (Rather than throwing an error if the input arrays are of different size, as zip
currently does). Maybe some other things too, but it's been a while since I read through the entire thread.
Note that these issues only really arise in TypeUnknown
iterators. For homogeneous tuple arrays there is the Destruct package, which implements a simple and efficient unzip
operation.
I wrote up https://github.com/bramtayl/Unzip.jl and it should be registered 3 days from now.
afaic a clear candidate after looking at dump(zip(avec, bvec))
unzip
that which is zip
ped
unzip(x::Base.Iterators.Zip) = x.is
Feel free to open up a PR over there, but I'm not so sure about that method. The behavior in my version unzip
is to collect new vectors, not pass along old ones. z.is
also pretty easy to type if you already have a zip.
iknow uknow this is furiously obvious
* `unzip` that which is `zip`ped
unzip(x::Base.Iterators.Zip) = z.is
How is this obvious? It's one possible way to define unzip
of zip
, and has the (IMO major) disadvantage that unzip
is not guaranteed to return arrays of equal length.
(it works better with z.is
replaced by x.is
)
It will return the arrays that were given to zip
.
How is that a major disadvantage for unzip
?
I conceived that its essential purpose is to recover what zip
got.
Certainly, the unzip of a collect
ed zip operates on equilength firsts and .. and lasts, irrespective of what zip
got. And that is appropriate, as one can only process information that exists when the processing occur. The most performant approaches to unzip(x::Vector{Tuple{_}})
cannot be expected to perform as quickly as does doing x.is
.
To forgo that very high throughput advancing some computation because it provides unequal length vectors iff unequal length vectors were given to zip
seems unnecessary. Where that is undesirable, one may apply e.g.rtrim_longer_vecs(unzip(zipped))
and obtain equi-sized vectors.
@zzyxyzz I meant "obvious" in the sense that this is what dump(zip(a,b))
shows. I had not known about the inner structure of our zips until today .. so, I revised that phrase: "afaic a clear candidate after looking at dump(zip(avec, bvec))
".
return arrays of equal length
good point
Maybe parents
would be a good name for what function (x::Base.Iterators.Zip) = x.is end
does
@JeffreySarnoff, I'm not saying that making unzip(zip(...))
a no-op is necessarily a bad idea. It's just that unzip
has a surprising number of loose ends and subtle trade-offs for something that is so simple in principle. This is marked as "good first issue", but it really isn't. Over the years several people have taken a stab at it (@bramtayl has come furthest), but it's no accident that there is still no generally accepted solution.
Maybe this issue should be renamed, following Julia tradition, "Taking unzip seriously". :stuck_out_tongue:
Registered!
For unzip(x::Base.Iterators.Zip)
where the lengths of x.is
are equal, would you rather return x.is
or a copies of the constituents?
@JeffreySarnoff
For unzip(x::Base.Iterators.Zip) where the lengths of x.is are equal, would you rather return x.is or a copies of the constituents?
My personal take on this:
In any decent implementation, unzip
should call its source iterator exactly once for each input tuple. This forces unzip
to construct the output arrays in lock-step; it cannot finish the first array before starting on the second. Thus, unzip
cannot act as a lazy iterator in any reasonable sense. It is fundamentally a collect
-like operation. So the safe, consistent thing to do would be to emulate the behavior of collect
and always return freshly allocated arrays.
But I do admit that the temptation to special-case unzip(::Base.Iterators.Zip)
is strong. And it wouldn't be the first time that safe, predictable behavior is sacrificed for performance in Julia. :smile:
No post in 2021 ? So let's give it a try now! HNY 2022 Unzip !
@ https://gist.github.com/o314/214e26c6fb70512b56597d633dd87e6f
see https://github.com/JuliaLang/julia/issues/13942
OK unzip(a) = zip(a...)
fails in Julia
But Zen of python is great - ok, it's somewhat a lie, it's also great for julia !
So let's try to bring correct simple things first, and (maybe) complicated ones later or away.
FWIW, my good-enough-poor-man, but usable in prod, isn't it, unzip considering than
using Test
import Base.Iterators as _I
unzip(s...) = unzip(collect(s))
unzip(vs::Vector{<:Vector}) =
let M=length(vs), N=mapfoldl(length, min, vs); # todo remove me when SVector is in Base
([vs[i][j] for i in 1:M] for j in 1:N)
end
unzip(a::Vector{<:Pair}) = [k for (k,_) in a], [v for (_,v) in a]
import Base.Iterators as _I
using Test
# zipdata(M,N) = let v=collect(1:M), vt=ntuple(N) do _; copy(v) end; vt end
data(M,N) = ntuple(M) do i; fill(i,N) end
data(N) = let ks=_I.take(_I.cycle('a':'z'), N), vs=(1:N...,); (k=>v for (k,v) in zip(ks,vs)) end
# unzip of vector
@test data(5,3) == ([1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5])
@test unzip(data(5,3)...) |> collect == ([1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5]) |> collect
# unzip of pair vector
@test data(5) |> collect == ('a'=>1, 'b'=>2, 'c'=>3, 'd'=>4, 'e'=>5) |> collect
@test unzip(data(5) |> collect) |> collect == (['a','b','c','d','e'], [1,2,3,4,5]) |> collect
# unzip of vector
julia> @time unzip(data(1000,3)...);
0.029086 seconds (42.07 k allocations: 2.766 MiB, 99.28% compilation time)
julia> @time unzip(data(1_000_000,3)...);
1.507531 seconds (4.04 M allocations: 223.797 MiB, 18.21% gc time, 72.32% compilation time)
julia> @time unzip(data(1_000_000,3)...);
0.294386 seconds (1.00 M allocations: 152.588 MiB)
julia> @time unzip(data(1000,50)...);
0.000727 seconds (1.01 k allocations: 531.922 KiB)
julia> @time unzip(data(1_000_000,50)...);
1.082615 seconds (1.00 M allocations: 518.799 MiB, 48.09% gc time)
julia> @time unzip(data(1_000_000,50)...);
0.527460 seconds (1.00 M allocations: 518.799 MiB)
# unzip of pair vector
julia> @time unzip(data(1000));
2.728774 seconds (166.12 k allocations: 10.524 MiB, 99.98% compilation time)
julia> @time unzip(data(1000));
0.000334 seconds (2.00 k allocations: 116.375 KiB)
julia> @time unzip(data(1_000_000) |> collect); # BUG wo collect
0.634841 seconds (3.50 M allocations: 178.888 MiB, 18.39% gc time, 57.41% compilation time)
Hmm... @o314, that's not working for me:
julia> unzip(zip(1:10, 1:10))
Base.Generator{UnitRange{Int64}, var"#7#9"{Vector{Vector{Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}}}, Int64}}(var"#7#9"{Vector{Vector{Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}}}, Int64}(Vector{Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}}[[zip(1:10, 1:10)]], 1), 1:1)
julia> for i in unzip(zip(1:10, 1:10))
println(i)
end
Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}[zip(1:10, 1:10)]
julia> x = collect(zip(1:5, 2:2:10))
5-element Vector{Tuple{Int64, Int64}}:
(1, 2)
(2, 4)
(3, 6)
(4, 8)
(5, 10)
julia> collect(unzip(x))
5-element Vector{Vector{Tuple{Int64, Int64}}}:
[(1, 2)]
[(2, 4)]
[(3, 6)]
[(4, 8)]
[(5, 10)]
Per 1.7.2, the zip(...)
trick would not even scale to 1k!
$ julia --version
julia version 1.7.2
$ time julia -e 'zip(collect([3,2,5] for _ in 1:10)...)|>collect'
julia -e 'zip(collect([3,2,5] for _ in 1:10)...)|>collect' 0.38s user 0.09s system 98% cpu 0.474 total
$ time julia -e 'zip(collect([3,2,5] for _ in 1:1000)...)|>collect'
ERROR: StackOverflowError:
Stacktrace:
[1] _zip_iterate_interleave(xs1::NTuple{980, Tuple{Int64, Int64}}, xs2::Tuple{}, ds::NTuple{980, Missing})
@ Base.Iterators ./iterators.jl:368
[2] _zip_iterate_interleave (repeats 20 times)
@ ./iterators.jl:369 [inlined]
[3] _zip_iterate_all(is::NTuple{1000, Vector{Int64}}, ss::NTuple{1000, Tuple{}})
@ Base.Iterators ./iterators.jl:354
[4] iterate
@ ./iterators.jl:340 [inlined]
[5] copyto!(dest::Vector{NTuple{1000, Int64}}, src::Base.Iterators.Zip{NTuple{1000, Vector{Int64}}})
@ Base ./abstractarray.jl:890
[6] _collect(cont::UnitRange{Int64}, itr::Base.Iterators.Zip{NTuple{1000, Vector{Int64}}}, #unused#::Base.HasEltype, isz::Base.HasShape{1})
@ Base ./array.jl:655
[7] collect(itr::Base.Iterators.Zip{NTuple{1000, Vector{Int64}}})
@ Base ./array.jl:649
[8] |>(x::Base.Iterators.Zip{NTuple{1000, Vector{Int64}}}, f::typeof(collect))
@ Base ./operators.jl:966
julia -e 'zip(collect([3,2,5] for _ in 1:1000)...)|>collect' 130.60s user 3.81s system 99% cpu 2:14.47 total
Hell I miss an unzip
!
Someday unzip
will work. Eventually :sweat_smile:
note that since https://github.com/JuliaLang/julia/pull/50435 was merged, there is precedent to throw on follow-up operations when zipped iterators have unequal lengths. so many of the previous concerns upthread about the desire to & controversy of defining unzip(z::Zip) = zip.is
are addressed & no longer super relevant
w.r.t. returning the original iterators or copies, I definitely prefer not to copy. all unzip
needs to promise is a tuple of iterators, and if the user wants to copy when the method would otherwise not she can always write collect.(unzip(...))
Hi there,
apologies if this has already been addressed somewhere, but is there a reason that there is no
unzip()
function in Base?Ideally this would be a function that would take a
Vector{Tuple{ ... }}
and return aTuple{Vector, ..., Vector}
for output. E.g.A naive implementation might be something like