JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
44.97k stars 5.42k forks source link

Interest for an `Iterators.nth(x, n)` API? #54454

Open ghyatzo opened 1 month ago

ghyatzo commented 1 month ago

Hello,

After searching far and wide both in issues, PR and on the discourse, I could not find any discussion about adding an Iterators.nth(x, n) API just for ease of use and simplicity. This is the only other reference about this possibility I could find.

I have played a little bit with it in the past during various projects and ended up with a slight evolution over the basic version mentioned by @stevengj in the linked post, which I am carrying around when needed:

_inbounds_nth(itr, n) = getindex(iterate(Base.Iterators.drop(itr, n-1)), 1)
_safe_nth(itr, n) = begin
    y = iterate(Base.Iterators.drop(itr, n-1))
    isnothing(y) ? nothing : getindex(y, 1)
end
nth(itr, n; skip_checks=false) = skip_checks ? _inbounds_nth(itr, n) : _safe_nth(itr, n)

simple_nth(itr, n) = first(Iterators.drop(itr, n-1))

which offers the ability to skip bounds checking at the expense of a crash (opposed to just returning nothing).

julia> itr = collect(1:10000)
julia> _safe_nth(itr, 10001)

julia> _inbounds_nth(itr, 10001)
ERROR: MethodError: no method matching getindex(::Nothing, ::Int64)
Stacktrace:
 [1] _inbounds_nth(itr::Vector{Int64}, n::Int64)
   @ Main .\REPL[198]:1
 [2] top-level scope
   @ REPL[205]:1

but that offers decent performance benefits, although we can't escape the O(n) complexity without extra assumptions (not that I know of at least)

julia> @btime _inbounds_nth(itr, 9999) setup=(itr=collect(1:10000))
  151.222 ns (0 allocations: 0 bytes)
9999

julia> @btime _safe_nth(itr, 9999) setup=(itr=collect(1:10000))
  4.400 μs (0 allocations: 0 bytes)
9999

julia> @btime simple_nth(itr, 9999) setup=(itr=collect(1:10000))
  4.414 μs (0 allocations: 0 bytes)
9999

(btw simple_nth also errors out when called out of bounds).

Instead of straight up opening a PR I wanted to check if there was any desire for this kind of little QOL pieces of code. And more importantly, check with much more knowledgeable people a couple of doubts:

Tortar commented 1 month ago

I just note that

_safe_nth(itr, n) = begin
    y = iterate(Base.Iterators.drop(itr, n-1))
    ifelse(isnothing(y), nothing, getindex(y, 1))
end

is as fast as your _inbounds_nth.

julia> @btime _safe_nth(itr, 9999) setup=(itr=collect(1:10000))
  161.977 ns (0 allocations: 0 bytes)
9999

Actually I'm a bit confused by the fact that the normal branching has a so high cost.

ghyatzo commented 1 month ago

That is great, didn't know about ifelse! The performance disparity might be due to the fact that ifelse is a normal function call, so it evaluates all arguments beforehand which might help with eliminating the branching altogether?

At this point there isn't really a reason to have a "safe" and "unsafe" version. might as well always check for nothing and have the best of both worlds.

Tortar commented 1 month ago

Actually I think the performance gain is just some kind of edge case optimization, consider this with your original version:

julia> itr = Iterators.filter(x -> x != 10, 1:10000);

julia> @btime _inbounds_nth($itr, 9999);
  7.086 μs (0 allocations: 0 bytes)

julia> @btime _safe_nth($itr, 9999);
  7.083 μs (0 allocations: 0 bytes)

In any case I think that returning only the element and not a new iterator starting from there is not ideal because usually one wants to go on with the iteration afterwards so I would consider something like:

julia> nth(itr, n) = Iterators.peel(Iterators.drop(itr, n-1))

julia> @btime nth($itr, 9999);
  7.086 μs (0 allocations: 0 bytes)

but at the same time it is just a one-liner so I'm not sure it is worth it

ghyatzo commented 1 month ago

I actually think that a function such as nth(itr,n) is more of an endpoint in the lifetime of an iterator. Therefore, when you are calling nth you get the end result and not the continuation of the iterator. Plus it matched the intuitive action of "get me the nth element", without forcing the user to deal with the rest or status at every callsite of the nth function. Following a bit the principle of least surprise.

For many intents and purposes, I see nth(itr, n) as a generalisation of the first(itr) function in Base:

nth(itr, n) = begin
    y = iterate(Base.Iterators.drop(itr, n-1))
    ifelse(isnothing(y), nothing, getindex(y, 1))
end

function first(itr)
    x = iterate(itr)
    x === nothing && throw(ArgumentError("collection must be non-empty"))
    x[1]
end

# it could become just this 
# (not backward compatibile and slower, i know, it's just to showcase)
first(itr) = nth(itr, 1)

in my opinion the number of lines of code shouldn't matter when talking about APIs, if it's just a one-liner all the better, but it shouldn't be a justification for not putting something in, just for reference, this is the implementation of first(itr, n) and last(itr, n) in Base:

first(itr, n::Integer) = collect(Iterators.take(itr, n))
last(itr, n::Integer) = reverse!(collect(Iterators.take(Iterators.reverse(itr), n)))