JuliaCollections / IterTools.jl

Common functional iterator patterns
Other
154 stars 29 forks source link

Iterator slicing/indexing #26

Open tkf opened 6 years ago

tkf commented 6 years ago

Is there a way to slice/index arbitrary iterator in Julia? Can you add a function ("slice") that can do that? It's like Python's itertools.islice and it does

silce(finite_iterator, indices) == collect(finite_iterator)[indices]

but it also works with infinite iterator and indices, provided that indices are strictly increasing (otherwise you need to cache the whole history). I actually coded up a quick implementation:

struct Slice{I, S}
    iter::I
    slice::S
end
const slice = Slice

mutable struct SliceState
    index::Int
    iter_state
    slice_state
    iter_value
end

Base.start(it::Slice) =
    SliceState(0, start(it.iter), start(it.slice), nothing)

function Base.done(it::Slice, state)
    done(it.slice, state.slice_state) && return true

    local x
    i = state.index
    n, ss = next(it.slice, state.slice_state)
    state.slice_state = ss
    si = state.iter_state
    @assert i < n
    while i < n
        if done(it.iter, si)
            return true
        end
        x, si = next(it.iter, si)
        i += 1
    end
    state.iter_state = si
    state.iter_value = x
    state.index = i

    return false
end

Base.next(it::Slice, state) = (state.iter_value, state)
Base.iteratorsize(::Type{<: Slice}) = Base.SizeUnknown()

using Base.Test
# @show collect(slice(1:100, 1:2:5))
@test collect(slice(1:100, 1:2:5)) == collect(1:2:5)
@test collect(slice(1:5, 1:2:10)) == collect(1:2:5)
@test collect(slice(1:10, [1, 3, 7, 11])) == [1, 3, 7]

Side notes: Besides obvious lack of documentation and good error message (rather than @assert), the most smelly part of the code is the mutable SliceState and the calls to the next of the underlying iterators. But it seems that this is the usual way to solve this kind of problem for Julia <= 0.6. I guess it's going to be easier with the new iteration protocol for Julia 1.0.

Can IterTools.jl and/or Base already do it? Am I missing something? If not, do you think it is a good API to add in IterTools.jl? I can make a PR with a bit more refined code if you like.

iamed2 commented 6 years ago

Great idea and thanks for the jumpstart! I'd love a PR.

I think it might be good to enforce the restriction that indices be strictly increasing. Perhaps just error if !issorted(slice)? Python gets around this by doing the equivalent of only accepting StepRange{<:Integer, <:Integer}.

done should not call next and advance the iterator. As a general rule, done should be cheap and simple. It's okay for next to return anything if done has returned true though, if that makes it easier.

Technically iteratorsize is IsInfinite if both iter and slice are IsInfinite, and SizeUnknown otherwise.

Don't forget eltype :)

slice has a particular meaning in Python which doesn't exist in Julia. I think we should choose another name for this. Brainstorming:

Another thought; this, nth, and takenth are fairly similar. I wonder if we can combine them.

tkf commented 6 years ago

Thanks for the feedback!

done should not call next and advance the iterator.

I don't think it's possible to do that here. That was actually my first try, but it didn't work:

struct Slice{I, S}
    iter::I
    slice::S
end
const slice = Slice

function Base.start(it::Slice)
    return (0, 0, start(it.iter), start(it.slice))
end

Base.done(it::Slice, ::Void) = true
Base.done(it::Slice, state) = done(it.slice, state[4]) ||
                              done(it.iter, state[3])
function Base.next(it::Slice, state)
    local x
    n, k_prev, si, ss = state
    k, ss = next(it.slice, ss)
    if k <= k_prev
        if k_prev < 1
            error("Slicing index starts with a number smaller than 1.")
        end
        error("Slicing index is not strictly increasing ($k_prev -> $k).")
    end
    # @assert n < k
    while n < k
        if done(it.iter, si)
            # return (????, (n, k, si, ss))
            return nothing  # cannot return any useful result here
        end
        x, si = next(it.iter, si)
        n += 1
    end
    return (x, (n, k, si, ss))
end

Base.iteratorsize(::Type{<: Slice}) = Base.SizeUnknown()

using Base.Test
@show collect(slice(1:100, 1:2:5))
@show collect(1:2:5)
@test collect(slice(1:100, 1:2:5)) == collect(1:2:5)
@test collect(slice(1:5, 1:2:10)) == collect(1:2:5)
# This doesn't work:
@test collect(slice(1:10, [1, 3, 7, 11])) == [1, 3, 7, 11]

The problem is that Slice iterator can't know when to stop unless advancing underlying Slice.iter. See if done(it.iter, si) block inside the while loop.

As a general rule, done should be cheap and simple.

I agree that a cheap done is the best choice if possible (which is the most of the cases), but it is not possible in some cases, including this one. It seems that this is the common approach for Julia < 1.0:

There's a fundamental awkwardness to using the start/next/done protocol to iterate things like streams or tasks that can't know if they're done or not until they've tried to get the next item. Currently, we work around this mismatch by advancing the iterator in done instead of in next.

--- StefanKarpinski in https://github.com/JuliaLang/julia/issues/8149

Hopefully, we can get rid of this sub-optimal solution in Julia 1.0: https://github.com/JuliaLang/julia/issues/18823

But until Julia 1.0, this is all we can do, I think. Is it OK to advance iterators in done?

Technically iteratorsize is IsInfinite if both iter and slice are IsInfinite, and SizeUnknown otherwise.

Ah, good catch. Thank. I'll fix that when I'm sending a RP.


More brainstorming for the name:

I like viewing since we already have Base.view for arrays. But it's not exactly similar to Base.view since obviously you can't do setindex! to iterator. It's more like a read-only view (and so I added iread/reading). But still viewing looks nice..

From your suggestions, I think pick is good. index is also nice when you use it in the macro form: @index iter[5:3:end].

Another thought; this, nth, and takenth are fairly similar. I wonder if we can combine them.

Yeah I agree. I think we can dispatch on the type of the second argument (indices or an index) but I think we can only combine two, like this:

Combining "slice" and nth:

f1(iter, indices)               # == "slice"(iter, indices)
f1(iter, index::Integer)        # == nth(iter, index)

Combining "slice" and takenth:

f2(iter, indices)               # == "slice"(iter, indices)
f2(iter, stride::Integer)       # == takenth(iter, stride)

I think choosing one of the two cases narrows possible set of names. For example, the name viewing and the interface f1 "slice"-&-nth do not work well. But all of your names seem to work for both interfaces.

If what you mean by "combine them" is just reusing the implementation, we can do it after a more efficient specific implementation for indices::StepRange is done. But I'd like to do a general implementation for a generic iterator first.

tkf commented 6 years ago

Come to think of it, I just realize I don't understand why takenth works without the next-in-done hack. Reading the code, I noticed I was making a stronger assumption for how IterTools should work. That is to say, I thought takenth tries to not advance the underlying iterator as much as possible. But it was not the case. That is to say, the code

for (i, x) in enumerate(iter)
    if i % 2 == 1
        f(iter, x)
    end
end

and

for x in takenth(iter, 2)
    f(iter, x)
end

are not equivalent if iter is stateful. For example, the code below does not work:

using Base.Test
using IterTools

buf = IOBuffer(join(1:10, "\n"))
for (i, line) in enumerate(takenth(eachline(buf), 2))
    @show i, line
    if i > 2
        break
    end
end

@test readline(buf) == "7"

It is common for stateful algorithm to provide an iterator interface (e.g., DifferentialEquations.jl) and it would be nice if IterTools.jl can interact well with such algorithms. For this, IterTools has to have a strong guarantee that it advances underlying iterator only at the very last moment. It means to advance iterators in done for Julia 0.6. But it will be easier to do in 1.0.

Do you think this stronger guarantee on how IterTools advances underlying iterators makes sense? Do you think the next-in-done hack is OK for Julia 0.6?

tkf commented 6 years ago

So I implemented iteratorsize/iteratoreltype/eltype and also length for the case indices are from OrdinalRange: https://gist.github.com/tkf/16b06fd6a4943c8d524295152c231b13

I can send a PR with this code (with some addition of docstring) if next-in-done hack is OK and the name is decided.

iamed2 commented 6 years ago

Can you come up with an example you're interested in where your implementation is faster than this one:

function myviewing(itr, inds)
    imap(last, Iterators.filter((x)->first(x) in inds, enumerate(itr)))
end

It's faster in most cases. For very large inds we could reduce search time by checking issorted and using searchsorted, or allocating a Set.

tkf commented 6 years ago

My implementation is doing mutation on fields of Any type. It's certainly not aimed for performance, at least at the moment. It's more about implementing it correctly. But I think faster implementation is possible in Julia 1.0.

The filter-based implementation doesn't work when inds is non-terminating. I don't know when my implementation becomes faster compared to it. It is always possible to "dispatch" to the filter-based implementation when inds is finite (and sufficiently small?).

tkf commented 6 years ago

Also, myviewing does not satisfy the "stronger guarantee" I was emphasizing:

using Base.Test
using IterTools

function myviewing(itr, inds)
    imap(last, Iterators.filter((x)->first(x) in inds, enumerate(itr)))
end

@testset "myviewing" begin
    buf = IOBuffer(join(1:10, "\n"))
    inds = [1, 3, 5]
    for (i, line) in zip(inds, myviewing(eachline(buf), inds))
        @test string(i) == line
    end
    @test_broken readline(buf) == "6"
end

But this is how Iterators.filter behaves in 0.6. I wonder if Julia devs want to fix the semantics of it.