JuliaArrays / ShiftedArrays.jl

Lazy shifted arrays for data analysis in Julia
Other
50 stars 10 forks source link

Changed the behaviour of boundary checking to be compatible with other AbstractArray types. #44

Closed RainerHeintzmann closed 3 years ago

RainerHeintzmann commented 3 years ago

In ShiftedArray and CircShiftedArray the boundaries are now checked according to the new array size. This allows them to be used for PaddedArrays and more. However, it is a breaking change. It may be possible to think about two variants, one with the old way of boundary checking and one with the standard way.

piever commented 3 years ago

Thank you, I've added a few comment. I think it'd be relevant to do some benchmarking to see if with the new behavior one can reduce the overhead for CircShiftedArrays. I think it's worth to compare

v, w = rand(1000), rand(1000)
w .= v .- ShiftedArrays.circshift(v, 1)

with

for i in eachindex(w)
    w[i] = i == 1 ? v[1] - v[end] : v[i] - v[i-1]
end

to see what is the overhead due to mod. If it is too much, it may be worth to think if it can be made more efficient by a simple if condition (and doing the mod computation upon construction).

roflmaostc commented 3 years ago

Hey, I checked timings:

julia> using BenchmarkTools

julia> f(w, v) = begin
       for i in eachindex(w)
                  w[i] = i == 1 ? v[1] - v[end] : v[i] - v[i-1]
              end
           end
f (generic function with 1 method)

julia> g(w, v) = w .= v .- ShiftedArrays.circshift(v, 1);

julia> v, w = rand(1000), rand(1000);

julia> @btime g($w, $v); # is based on the modulo idea 
  1.425 μs (0 allocations: 0 bytes)

julia> @btime f($w, $v);
  652.564 ns (0 allocations: 0 bytes)

I then switched the implementation to a non-mod one where shifts is processed at the beginning:

julia> @btime g($w, $v);
  457.609 ns (0 allocations: 0 bytes)

Performance is therefore ~3x better. Unfortunately shifts is not equal to the input anymore. Is that a problem? Semantically it is still the same.

piever commented 3 years ago

I then switched the implementation to a non-mod one where shifts is processed at the beginning. Performance is therefore ~3x better.

That's great! As a micro-optimization (we should really optimize getindex as much as possible, as it is basically the only performance critical code in the package), it'd be good to also try a branch-free version. For example, once we know that 0 <= shifts < size(parent), I would recommend benchmarking something like

function newindex(old_idx::Int, sz::Int, shift::Int)
    diff = sz - shift # I imagine the compiler will figure out that this is always the same computation throughout the for loop
    return ifelse(old_idx > diff, old_idx - diff, old_idx + shift)
end

or some related trick multiplying by a boolean, eg bringwithin(i, sz) = i - (i > sz) * sz, instead of the ternary operator ?. The reason is that, even though it looks like the ternary operator may do less operations, it has to "branch", which would end up preventing some optimizations on the for loop, so it would probably generate slower code in the end (this blog post explains it much more clearly.)

Unfortunately shiftsis not equal to the input anymore. Is that a problem? Semantically it is still the same.

That's why I'd like to have that change here: since we are breaking one thing, we should break both at the same time and bump the version number. I think it's perfectly fine as a behavior, esp. if we document that shifts is the original shift modulo the size of the array.

roflmaostc commented 3 years ago

I'm not sure if newindex works for OffsetAxis? I believe you need to know start and end position of your axis.

Nevertheless, I tried bringwithin using ifelse but it didn't provide a great speedup. The reason is probably, that the branch prediction fails only once in the index position where the switch between left and rights occurs. For all other indices the branch is always the same and branch prediction should get that correct. Does that make sense?

I compared the current implementation also with OffsetArrays and performance is roughly 15% worse which makes sense because we need that additional check. So I'd say it performs well.

julia> vs = ShiftedArrays.circshift(v, 1);

julia> ws = ShiftedArrays.circshift(w, 1);

julia> test_f(w, v) = begin
                    for i in eachindex(w)
                               w[i] = v[i]
                           end
                        end
test_f (generic function with 1 method)

julia> @btime test_f($ws, $vs)
  867.741 ns (0 allocations: 0 bytes)

julia> wo = OffsetArray(w, 2:1001);

julia> vo = OffsetArray(v, 2:1001);

julia> @btime test_f($wo, $vo)
  758.871 ns (0 allocations: 0 bytes)
piever commented 3 years ago

Thanks for the changes! I've added a few comments. Please try to respect the original formatting in terms of whitespace inside tuples and newlines (single newline within test blocks or method definitions).

roflmaostc commented 3 years ago

Another thing, we could incorporate: Wouldn't it be nice being able to remove the circshifted wrapper if we shift backwards the same shift? Something like this:

if iszero.(shifts)
     return parent(csa)
end

So the constructor should dispatch on the CircShiftedArray and add the shifts instead of wrapping them. And if 0, return the plain array.

Ideally the following would be basically an identity operation returning the original randn array instead of the concatenated shifts:

julia> using ShiftedArrays

julia> ShiftedArrays.circshift(ShiftedArrays.circshift(randn((2,2)), (1,1)), (1,1))
2×2 CircShiftedArray{Float64, 2, CircShiftedArray{Float64, 2, Matrix{Float64}}}:
  0.14058   -1.14846
 -0.992392   1.37158

In fftshift and ifftshift it would be quite nice, so that simply the circshifted wrapper is removed because ifft only works with a ShiftedArray if you do an allocate upfront calling it. So removing the wrapper once the shift is 0, prevents that.

piever commented 3 years ago

Looks good to me, I'll just wait a little bit to merge in case somebody has objections.

So the constructor should dispatch on the CircShiftedArray and add the shifts instead of wrapping them. And if 0, return the plain array.

I feel a bit uncomfortable changing the behavior of the constructor, but the various syntactic sugar functions (ie, lag, lead, circshift) could definitely be a bit smarter and just add the shifts if the underlying vector is a (Circ)ShiftedArray. Unwrapping if shift is zero is probably not ideal, because it would not be type stable.

roflmaostc commented 3 years ago

I agree on the type issue, could be problematic.

We are discussing for FourierTools.jl whether we introduce something like a FFTShiftand a IFFTShift type that could be removed, added, etc. We probably start implementing it there, and if it works fine, we could also add that to ShiftedArrays (if wanted, it also makes sense to have that only in FourierTools.jl)

roflmaostc commented 3 years ago

Ups, force-pushed the wrong repo, but restored it from a local version :smile:

roflmaostc commented 3 years ago

However, what I was trying to do: In the case of concatenation of ShiftedArrays (like fftshift and ifftshiftdo in combination) we could instead of generate a nested CircShiftedArray simply update the shift, right? The change would be minor, we just need a new constructor dispatching on CircShiftedArray. That would prevent the type instability because the CircShiftedArray would be still here, but just with zero shift (which can be easily removed in our FourierTools.jl then.

Since this would be breaking as well, what do you think of incorporating?

struct CircShiftedArray{T, N, S<:AbstractArray} <: AbstractArray{T, N}
    parent::S
    # shift stores the circshift but is modified with mod (see below) to be more efficient
    # in the getindex method
    # negative shifts (left shift) are converted to positive (right shifts) with a larger
    # shift number
    shifts::NTuple{N, Int}
    function CircShiftedArray(p::AbstractArray{T, N}, n = Tuple(0 for i in 1:N)) where {T, N}
        @assert all(step(x) == 1 for x in axes(p))
        n = map(mod, _padded_tuple(p, n), size(p))
        new{T, N, typeof(p)}(p, n)
    end 

    function CircShiftedArray(csa::CircShiftedArray, n = Tuple(0 for i in 1:N))
        CircShiftedArray(parent(csa), n .+ csa.shifts)
    end 
end

Results in:

julia> x = [1,2,3,4,5]
5-element Vector{Int64}:
 1
 2
 3
 4
 5

julia> xs = ShiftedArrays.fftshift(x)
5-element CircShiftedVector{Int64, Vector{Int64}}:
 4
 5
 1
 2
 3

julia> xss = ShiftedArrays.ifftshift(xs)
5-element CircShiftedVector{Int64, Vector{Int64}}:
 1
 2
 3
 4
 5

julia> xss.shifts
(0,)
piever commented 3 years ago

LGTM, if you are happy with it I can go ahead and merge it. We should probably discuss the nested lag, lead and circshift in a separate PR / issue. (About that, I would definitely prefer to keep the CircShiftedArray constructor as is, but we could consider special-casing circshifted(c::CircshiftedArray, shift), lag(s::ShiftedArray, shift), and lead(s::ShiftedArray, shift) .)

roflmaostc commented 3 years ago

I believe we can merge, yeah. This special-casing sounds pretty nice imo.