Open ZacCranko opened 8 years ago
We definitely need an unzip
function in Base.
Or even simpler
unzip(a) = zip(a...)
Trying the zip(...) thing on any non-toy problem shows that it's not so good pretty fast, unfortunately.
I just ran into this issue when I absentmindedly used the Python-esque zip(arr...)
where length(arr)=6000
. It was not pretty. A dedicated unzip
would be nice. Maybe even throw a warning that you shouldn't splat anything nontrivial.
Maybe even throw a warning that you shouldn't splat anything nontrivial.
Like almost all other performance tips in julia, doing so is perfectly fine and sometimes wanted for various reasons. (OK, splatting 6000 elements is a bit crazy, but splatting unknown size can be useful....). The only thing that matters is to not put it in performance critical path.
It looks like this is not covered in the performance tip?
Like almost all other performance tips in julia, doing so is perfectly fine and sometimes wanted for various reasons. (OK, splatting 6000 elements is a bit crazy, but splatting unknown size can be useful....). The only thing that matters is to not put it in performance critical path.
Yeah, a warning is a bad idea.
It looks like this is not covered in the performance tip?
I think this should also be marked under differences from other languages. People coming from Python are probably used to the zip(*arr)
syntax and zip(arr...)
might be a tempting target.
You can dispatch only on arrays of tuples:
function unzip{T<:Tuple}(A::Array{T})
res = map(x -> x[], T.parameters)
res_len = length(res)
for t in A
for i in 1:res_len
push!(res[i], t[i])
end
end
res
end
Bump, we still need and unzip function.
BUMP if anyone is bored...
function unzip(zipped::Base.Iterators.Zip2{Array{T1,1}, Array{T2,1}}) where {T1,T2}
n = length(zipped)
vec1 = Vector{T1}(n)
vec2 = Vector{T2}(n)
i = 1
for pair in enumerate(zipped)
vec1[i], vec2[i] = pair[2]
i = i+1
end
return vec1, vec2
end
maybe <= 1.5x is
function unzip(zipped::Base.Iterators.Zip2{Array{T1,1}, Array{T2,1}}) where {T1,T2}
return map(first,zipped), map(last,zipped)
end
unzip(zipped::Base.Iterators.Zip2) = (zipped.a, zipped.b)
But all those differ in how they treat length(a) != length(b)
. I think I'd favor the behavior of the first.
This trivial case apart, the desired semantics are not absolutely clear to me: Given an iterator itr
, should unzip(itr)
return an iterator unzipped
, that generates iterators unzipped[1]
to unzipped[N]
, where N
is the length of the iterators generated by itr
? What if the first 10000 values generated by itr
are Tuple
s of length N==3
, but then comes a Tuple
of length 2? Throw when iterating at least one unzipped[n]
up to that point? Or eagerly collect itr
the moment unzip
is called?
Think about
unzip(f(n) for n in Base.Iterators.countfrom(0))
for different f
s to get an idea of my open questions.
There seems to be multiple use cases for an unzip, with different requirements.
A common case is involves Arrays of Tuples, in which the size is known and array shape must be preserved. Some suggestions here do not obey this constraint.
I have come up with a solution for this specific case using Base.Cartesian
. It's up to twice as fast as the alternative commonly suggested alternative for NTuples, and more than 10x for heterogeneous tuples.
See: https://github.com/spalato/Destruct.jl
Feel free to use and steal.
Can we get this for generalized iterators and not just arrays? Basically just a collect_many modeled after collect?
I could give it a stab? The trickiest bit (I think) will be getting an equivalent of default_eltype for zero length eltype unknown iterators
Go for it, @bramtayl!
Tada, this seems to work. Great excuse not to finish my grading.
struct Unzip{Iterator}
iterator::Iterator
end
iterate(iterator::Unzip) = iterate(iterator.iterator)
iterate(iterator::Unzip, state) = iterate(iterator.iterator, state)
IteratorEltype(iterator::Unzip) = IteratorEltype(iterator.iterator)
eltype(iterator::Unzip) = eltype(iterator.iterator)
IteratorSize(iterator::Unzip) = IteratorSize(iterator.iterator)
axes(iterator::Unzip) = axes(iterator.iterator)
size(iterator::Unzip) = size(iterator.iterator)
length(iterator::Unzip) = length(iterator.iterator)
struct ModelArray{ElementType, NumberOfDimensions, Model, Rest <: Tuple} <:
AbstractArray{ElementType, NumberOfDimensions}
model::Model
rest::Rest
end
model_array(model, rest...) =
ModelArray{
Tuple{eltype(model), eltype.(rest)...},
ndims(model),
typeof(model),
typeof(rest)
}(model, rest)
axes(array::ModelArray) = axes(array.model)
size(array::ModelArray) = size(array.model)
IndexStyle(array::ModelArray) = IndexLinear()
@propagate_inbounds getindex(array::ModelArray, index::Int) = (
getindex(array.model, index),
getindex.(array.rest, index)...
)
@propagate_inbounds setindex!(array::ModelArray, value::Tuple, index::Int) = (
setindex!(array.model, value[1], index),
setindex!.(array.rest, tail(value), index)...
)
push!(array::ModelArray, value::Tuple) = (
push!(array.model, value[1]),
push!.(array.rest, tail(value))
)
# TODO: use fieldtypes instead of value_field_types in a future with more constant propagation
value_field_types(a_type) =
ntuple(index -> Val(fieldtype(a_type, index)), fieldcount(a_type))
function similar(
array::Unzip,
::Type{ElementType},
dims::Dims
) where {ElementType, NumberOfDimensions}
inner_array(::Val{InnerElementType}) where InnerElementType =
Array{InnerElementType}(undef, dims)
model_array(inner_array.(value_field_types(ElementType))...)
end
# Unzip needs to replicate AbstractArray similar machinery
similar(a::Unzip, ::Type{T}) where {T} = similar(a, T, to_shape(axes(a)))
similar(a::Unzip{T}, dims::Tuple) where {T} = similar(a, T, to_shape(dims))
similar(a::Unzip{T}, dims::DimOrInd...) where {T} = similar(a, T, to_shape(dims))
similar(a::Unzip, ::Type{T}, dims::DimOrInd...) where {T} = similar(a, T, to_shape(dims))
similar(a::Unzip, ::Type{T}, dims::Tuple{Union{Integer, OneTo}, Vararg{Union{Integer, OneTo}}}) where {T} = similar(a, T, to_shape(dims))
similar(array::ModelArray, ::Type{ElementType}, dims::Dims) where ElementType =
similar(Unzip(nothing), ElementType, dims)
_similar_for(c::Unzip{Nothing}, T, itr, ::SizeUnknown) =
similar(c, T, 0)
_similar_for(c::Unzip{Nothing}, T, itr, ::HasLength) =
similar(c, T, Int(length(itr)::Integer))
_similar_for(c::Unzip{Nothing}, T, itr, ::HasShape) =
similar(c, T, axes(itr))
collect(iterator::Unzip) = _collect(
Unzip(nothing),
iterator,
IteratorEltype(iterator),
IteratorSize(iterator)
)
# Examples
Generator(x -> (x, x + 1.0), [1]) |> Unzip |> collect
Generator(x -> (x, x + 1.0), [1, missing]) |> Unzip |> collect
zip([1], [1.0]) |> Unzip |> collect
[(1, 1.0)] |> Unzip |> collect
Filter(x -> x[1] == x[2], [(1, 1.0)]) |> Unzip |> collect
``
The above strategy turns out to be extremely sub-optimal for EltypeUnknown iterators, but it still works. I'm not sure I have the programming wherewithall to design a better system.
I'm thinking it would probably just need specialized ::ModelArray methods for grow_to! and collect_to! to be efficient, I'll see what I can cook up
Ok now that #30076 has gone through here are the last two methods to make this efficient (I think):
import Base: setindex_widen_up_to
maybe_setindex_widen_up_to(dest::AbstractArray{T}, el, i) where T =
if isa(el, T)
@inbounds dest[i] = el
dest
else
setindex_widen_up_to(dest, el, i)
end
setindex_widen_up_to(dest::ModelArray, el, i) =
model_array(maybe_setindex_widen_up_to.((dest.model, dest.rest...), el, i)...)
Should this go into base? Live as a package? If so, any comments before I cook up a PR?
I'm leaving the code here for now https://github.com/bramtayl/Unzip.jl
Is there a reason that it does not build?
It should only work on master
Can this become a PR, and then get merged? Some docs would also need to be updated
I'm not sure I have the programming chops to make a PR. Or rather, I have some questions: what file should the code go into? What modifications do I need to make to take into account bootstrap order? Do I need boundschecking in the ModelArray constructor to make sure all the inputs are the same size? How can I adjust the code to take into account offset axes? What would be a good way to effectively benchmark this? How can I tell if the widening optimizations are working as intended?
Also, I've run into a bit of a problem. Not sure what do in the case of EltypeUnknown, SizeUnknown iterators where @default_eltype
yields Union{}
. We need to know at least the length of the tuple to start unzipping... I suppose, for an inference free implementation, we'd have to require the tuple length explicitly... ugh...
I think it is okay to make some assumption on the iterator which is to be unzipped (e.g. not returning tuples of different length) and it would okay to give an error message in the case this contract is not fulfilled.
Well, well. take that in python, write this in julia:
s1,s2 = (1,2,3),('a','b','c')
s = tuple(zip(s1,s2)...)
@test tuple(zip(s...)...) == (s1,s2)
Test Passed
so
unzip(z) = tuple(zip(z...)...)
@test (unzip ∘ zip)(s1,s2) == (s1,s2)
Test Passed
One more impl, quickly usable. Let's call it the haiku version of julia unzip
PS: edited v1 w array removed, was not type stable. v2 w tuple seems ok
https://github.com/JuliaLang/julia/issues/13942#issuecomment-155668226 and the comment below.
We have an implementation; all that's required is for someone to make a PR to add it to Base. The issue is already tagged as "good first issue" and "help wanted" so hopefully someone feels like trying it.
@KristofferC
Sorry, code was so short it pass under my radar
using Test
unzip(z) = tuple(zip(z...)...)
s,t = [1:1_000_000...],[1:1_000_000...];
z = zip(s,t) |> collect;
@test_throws StackOverflowError unzip(z)
s,t = [1:100...],[1:100...];
z = zip(s,t) |> collect;
@time unzip(z);
# 4.105195 seconds (26.41 M allocations: 1.021 GiB, 17.28% gc time)
it's an useless toy
Yes, that's been noted:
Trying the
zip(...)
thing on any non-toy problem shows that it's not so good pretty fast, unfortunately.
I was referring to @bramtayl's Unzip package, which has a real implementation... but also appears to have been deleted from GitHub. So I guess we do not actually have an implementation. @bramtayl's is there any chance you can resurrect your Unzip code? Was there some reason you deleted it?
Of course, that's was clearly understandable. I was just answering to Kristoffer about the splat option. Didn't figure numbers were so bad.
Thanks for pointing out the correct track. Still a bit hard for me at this point to contribute to Base. Good luck to other people...
No worries. I didn't bother measuring the splatting option because I knew it was going to be terrible 😬
OK. That is surprisingly pretty good however in python
def unzip(z): return zip(*z)
def bench_zip_unzip():
z = [*zip(range(1_000_000), range(1_000_000))]
u = [*unzip(z)]
import timeit
timeit.timeit(bench_zip_unzip, number=1)
# 0.4816088059997128 second
one other thing that would be useful would be to allow it to take a function first argument, so that you could do e.g.
s,c = unzip(sincos, x)
for a vector (or array) x
.
Copying some thoughts from a thread on Slack, the utility of unzip
is that it works like collect
does on a single iterable, materializing it into a concrete vector but producing as many vectors as there are elements in each tuple that the iterable yields. The iterable must yield the same sized tuple each time; it need not, however, yield the same element types each time; unzip
should handle this the same way that collect or comprehensions do (they're slightly different, so I guess that's a design choice), for each vector that it outputs. So the result of unzip(itr)
should be the same as
ntuple(length(first(itr))) do i
[t[i] for t in itr]
end
except that it only consumes itr
a single time and it should error if any of the tuples has a different length than the first one. That will be a bit tricky to write 🤔. Mocked up example:
julia> itr = [(1,2,3),(3,1.5,'x')]
2-element Array{Tuple{Int64,Real,Any},1}:
(1, 2, 3)
(3, 1.5, 'x')
julia> unzip(itr)
([1, 3], Real[2, 1.5], Any[3, 'x'])
See #30987
Thanks, @bramtayl!
It will be good if we can unzip Array{Pair}
, Base.Iterators.Pairs
too. eg.
unzip([k=>v for (k,v) in zip('a':'c',1:3)]) == unzip(zip('a':'c',1:3))
bump
https://gist.github.com/bramtayl/2d6d323582f7c3158f9a1b5299997d68
Not skilled enough to make a PR, but I think this should be a pretty robust implementation
Would this be considered an acceptable solution?
unzip(z::Iterators.Zip) = z.is
unzip(itr) = unzip(collect(itr))
unzip(::AbstractArray) = error("Can only unzip array of tuples.")
@generated function unzip(a::AbstractArray{T}) where {T<:Tuple}
types = T.types
types[end] <: Vararg && error("Tuples in the array should have the same length.")
isempty(types) && return :(())
n = length(types)
quote
@nexprs $n i -> out_i = similar(a,$types[i])
@inbounds for j in eachindex(a)
@nexprs $n i -> out_i[j] = a[j][i]
end
return @ntuple $n out
end
end
90% of this implementation is taken from the package Destruct.jl by @spalato, who commented earlier in this thread.
This implementation is maximally efficient for dense arrays of tuples (which was the motivating example for this issue). It is slightly less efficient for arrays with non-standard indexing, due to reliance on eachindex
.
For the Zip
iterator, it just returns the underlying iterators without truncating them to a common size. I.e. a,b,c = unzip(zip(a,b,c))
is a no-op.
For other iterators and generators, it first collects the output into an array and unzips that. This is not ideal, since it requires twice as much memory, but I simply don't have enough experience with iterator internals to implement it better. (This is also why I haven't tried to ressurect @bramtayl's implementation: I simply couldn't wrap my head around it yet :smile:).
AFAICT, the above code fulfils all the functional requirements set out by @StefanKarpinski above. If this implementation is deemed acceptable, despite using generated functions and sacrificing efficiency in some cases, I can add tests and documentation and open a PR.
The code in the gist was written relatively recently, so I think it should be copy-pastable? It sounds like its a lot more fully featured than what your proposing (works with arbitrary iterators, returning a variable number of items). Just need someone to figure out the PR details (something I never seem to be able to do successfully)
I'm also happy to explain whatever is helpful. Lispy-tuple-recursion and iterative widening are both very useful to know how to do regardless.
It sounds like its a lot more fully featured than what your proposing (works with arbitrary iterators, returning a variable number of items).
Iterators are supported, but with some overhead. I think the main difference between the two variants is that you produce arrays with missing
values if the tuples have different length. Stefan's comment implies that an error should be thrown in this case, but I can see how using missing
values could be sometimes more useful. If this behavior is preferred, I think it can be implemented in the generating-function-based approach as well.
I'm also happy to explain whatever is helpful. Lispy-tuple-recursion and iterative widening are both very useful to know how to do regardless.
Thanks, I might have to take you up on that offer yet :smile:. But I really don't feel qualified to finish what you started there. I just spent several hours trying to understand your implementation and still only have a general idea of how it works. And that's without even touching on the missing
and widening stuff.
I also have to ask, why is this style preferred in Julia (in array library code, at least)? As far as I can tell, the end result is that a specialized method is produced for every combination of tuple and container type that is invoked in user code. Which is more or less the same as what a generated function would do. On the other hand, a generated function does exactly what it advertises, while the other solution goes back and forth through twenty layers of abstraction (this is not a metaphor; I think the call chain really is this deep, even though it all gets inlined in the end). I'm sorry if this sounds argumentative, and I'm also starting to think this may be the wrong place to ask such questions, but I'm genuinely curious why layered abstractions relying on vararg splatting on every other step are preferred to gen-funcs in general.
Two things:
1) Collecting into a array first is unworkable when some of the columns are small unions, because the eltype will be Any 2) Because Julia is a dynamic language, there is no return type. So you can't look at an iterator and guess what types each of the columns will end up being, or for that matter, how many columns there are going to be in the first place.
- Collecting into a array first is unworkable when some of the columns are small unions, because the eltype will be Any
You're right about that, of course. Though I think you are holding yourself to a higher standard than the standard library (no pun intended). Maybe it would be preferable to use your proposed changes to improve collect
, so that it doesn't widen types more than necessary? Then less skilled people, such as myself, could get away with falling back on it without too high a cost :smile:.
Also, if the "iterator" passed to unzip
is an actual array, there is the same problem: [(1,"a"), (2,3)]
infers as Array{Tuple{Int64,Any},1}
. Does your implementation tighten the output type to Tuple{Array{Int64,1}, Array{Union{Int64,String},1}}
in this case as well?
Yes
Hi there,
apologies if this has already been addressed somewhere, but is there a reason that there is no
unzip()
function in Base?Ideally this would be a function that would take a
Vector{Tuple{ ... }}
and return aTuple{Vector, ..., Vector}
for output. E.g.A naive implementation might be something like