JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.68k stars 5.48k forks source link

Add a keyword argument to `diff` which preserves length #42509

Open pdeffebach opened 3 years ago

pdeffebach commented 3 years ago

Currently Base.diff(x) produces a new vector with length length(x) - 1.

This is often annoying when working with tabular data, since you cannot do

df.x_diff = diff(df.x)

@nalimilan recommended a keyword argument to Base.diff, which allows for pre-pending a default value such that length, and shape more generally in the case of matrices and other arrays, is preserved.

Given the discussion in the issue below, I propose keyword arguments fillfirst and filllast which indicate the value appended to the array.

julia> begin 
       function newdiff(a::AbstractArray{T,N}; dims::Integer=1, 
                        fillfirst=nothing, 
                        filllast=nothing) where {T,N}
           Base.require_one_based_indexing(a)
           1 <= dims <= N || throw(ArgumentError("dimension $dims out of range (1:$N)"))

           r = axes(a)
           r0 = ntuple(i -> i == dims ? UnitRange(1, last(r[i]) - 1) : UnitRange(r[i]), N)
           r1 = ntuple(i -> i == dims ? UnitRange(2, last(r[i])) : UnitRange(r[i]), N)
           if fillfirst !== nothing  
               out = similar(a, Union{eltype(a), typeof(fillfirst)})
               out .= fillfirst
               out[r1...] .= view(a, r1...) .- view(a, r0...)
               return out
           elseif filllast !== nothing  
               out = similar(a, Union{eltype(a), typeof(filllast)})
               out .= filllast
               out[r0...] .= view(a, r1...) .- view(a, r0...)
               return out
           else
               view(a, r0...)
               return view(a, r1...) .- view(a, r0...)
           end
       end
       end
newdiff (generic function with 1 method)

julia> x = collect(1:5) # separate method for ranges;

julia> newdiff(x)
4-element Vector{Int64}:
 1
 1
 1
 1

julia> newdiff(x; fillfirst=0)
5-element Vector{Int64}:
 0
 1
 1
 1
 1

julia> newdiff(x; filllast=0)
5-element Vector{Int64}:
 1
 1
 1
 1
 0
nalimilan commented 3 years ago

I agree with is really needed. Another possibility for the API would be to have an argument called e.g. default or fill which you would set to missing.

timholy commented 3 years ago

The sad part about this is it's not obvious whether missing should go at the beginning or the end, and without reading the docs either guess would be reasonable. When you reduce by 1 there is no uncertainty about alignment, except for the fact that in a way we'd like to define the axes of the resulting array as 1.5:1:n-0.5.

pdeffebach commented 3 years ago

That's a reasonable point about ambiguity. But I'm not sure I've seen a context other than d_n = x_{n} - x_{n-1}. Where we return missing if we don't know what x_{n-1} is, meaning d_1 is missing.

timholy commented 3 years ago

Well, given that arrays in Julia start by default with 1, I'd say the obvious formula is d[n] = x[n+1] - x[n]. That's consistent with the fact that d is shorter by 1 than x, and the fact that the first diff value you can compute is first(d) = x[begin+1] - x[begin]; there isn't an earlier one you can compute. It's thinking of the array as a one-dimensional iterable collection of values and not a function in 1d---the index is almost meaningless. But if you collect from the iterable, you get d[1] = x[2] - x[1]; voila, there is nothing inevitable about the fact that the missing goes in the first slot.

Thinking of it as a function in 1d is why I suggested that the natural axes for d are on the half-integers, i.e., d[n+0.5] = x[n+1] - x[n]. But we don't support non-integer indexing so we can't really do this.

KristofferC commented 3 years ago

I don't think we should introduce Missings into arrays like this. Missings have quite unintuitive behavior for people that are not frequent users of the data science stack.

Another possibility for the API would be to have an argument called e.g. default or fill which you would set to missing.

Something like this would be better imho.

nalimilan commented 3 years ago

We could have fillfirst and filllast arguments (or just first and last) that one would set to missing or NaN or anything depending on the use case. Passing both would be an error.

petvana commented 3 years ago

What about NumPy syntax (prepend, append)?

 numpy.diff(a, n=1, axis=-1, prepend=<no value>, append=<no value>)
nickrobinson251 commented 3 years ago

what would be the benefit of something like diff(x; prepend=missing) over [missing; diff(x)] / vcat(missing, diff(x))? That it would maintain the container type?

nalimilan commented 3 years ago

That, and it would avoid making two allocations.

piever commented 3 years ago

I also think this would be useful as a so-called "window function" that preserves input length. The interface I had in mind was along the lines of

diff(v, n; default)

where n denotes the shift to perform before subtracting, and default denote the value to use when going outside the range of one of the two arrays. In practice, for a positive n, diff(v, n; default=missing) would have n missings at the beginning, and diff(v, -n; default=missing) would have n missings at the end.

There is a friction point in that it is not super clear to me whether the default should be used before subtracting (eg, subtract to v a shifted version of v padded on one side with default) or compute diff and pad the result with default. The former is easy to implement lazily with ShfitedArrays, see https://github.com/JuliaArrays/ShiftedArrays.jl/pull/51#issuecomment-934871076, but not the latter. The two things are equivalent for default=missing, but different in general.

pdeffebach commented 3 years ago

Just to emphasize, about missings in response to @KristofferC , we don't need to have missings here. Ideally an API would allow for any value the user wishes for the unknown differences.

The main goal of this issue (as indicated by the title) is an operation which preserves shape, not necessarily introducing missings.

I think @petvana 's idea about prepend and append is very good and would get rid of the ambiguity mentioned by Tim.

pdeffebach commented 3 years ago

All, I have updated the initial post in this issue with a proposal of fillfirst and filllast which makes no assumption about missings as a default value.

StefanKarpinski commented 3 years ago

What about this instead: introduce a wrap::Bool=false keyword which changes diff to produce a vector of the same size where the extra element (at the end) is the difference between the last and first element. Then if someone wants to replace that element with missing, they can just do an assignment afterwards. That doesn't cover the "prepend" case, but that seems less likely to be what someone wants as it changes the index that ever difference ends up at, whereas putting a[end]-a[1] at the end leaves all the other differences where they would otherwise be.

Which does bring me to this option: d = diff(v); push!(d, missing). Same effect, probably doesn't actually do any additional allocation.

nalimilan commented 3 years ago

The original motivation for this issue was to prepend missing. That's actually quite useful in modelling to create a variable giving the increase compared with the previous value: appending missing would be problematic for a causal interpretation as a future increase would be assigned to the current observation.

FWIW prepending missing is what R does with just diff(v) so there's clearly a use case for it (R doesn't even support appending missing). The NumPy method also accepts prepend and append arguments. Wrapping (a.k.a. circular shift) is yet another possibility, but I'm not sure it's the most common -- and anyway we can support all three behaviors.

Also note that d = diff(v); push!(d, missing) doesn't work as the eltype of d doesn't support missing in general.

JeffBezanson commented 3 years ago

We should also avoid push! to better prepare for the immutable future...