Open pdeffebach opened 3 years ago
I agree with is really needed. Another possibility for the API would be to have an argument called e.g. default
or fill
which you would set to missing
.
The sad part about this is it's not obvious whether missing
should go at the beginning or the end, and without reading the docs either guess would be reasonable. When you reduce by 1 there is no uncertainty about alignment, except for the fact that in a way we'd like to define the axes of the resulting array as 1.5:1:n-0.5
.
That's a reasonable point about ambiguity. But I'm not sure I've seen a context other than d_n = x_{n} - x_{n-1}
. Where we return missing
if we don't know what x_{n-1}
is, meaning d_1
is missing
.
Well, given that arrays in Julia start by default with 1, I'd say the obvious formula is d[n] = x[n+1] - x[n]
. That's consistent with the fact that d
is shorter by 1 than x
, and the fact that the first diff value you can compute is first(d) = x[begin+1] - x[begin]
; there isn't an earlier one you can compute. It's thinking of the array as a one-dimensional iterable collection of values and not a function in 1d---the index is almost meaningless. But if you collect
from the iterable, you get d[1] = x[2] - x[1]
; voila, there is nothing inevitable about the fact that the missing
goes in the first slot.
Thinking of it as a function in 1d is why I suggested that the natural axes for d
are on the half-integers, i.e., d[n+0.5] = x[n+1] - x[n]
. But we don't support non-integer indexing so we can't really do this.
I don't think we should introduce Missings into arrays like this. Missings have quite unintuitive behavior for people that are not frequent users of the data science stack.
Another possibility for the API would be to have an argument called e.g. default or fill which you would set to missing.
Something like this would be better imho.
We could have fillfirst
and filllast
arguments (or just first
and last
) that one would set to missing
or NaN
or anything depending on the use case. Passing both would be an error.
What about NumPy syntax (prepend
, append
)?
numpy.diff(a, n=1, axis=-1, prepend=<no value>, append=<no value>)
what would be the benefit of something like diff(x; prepend=missing)
over [missing; diff(x)]
/ vcat(missing, diff(x))
? That it would maintain the container type?
That, and it would avoid making two allocations.
I also think this would be useful as a so-called "window function" that preserves input length. The interface I had in mind was along the lines of
diff(v, n; default)
where n
denotes the shift to perform before subtracting, and default
denote the value to use when going outside the range of one of the two arrays. In practice, for a positive n
, diff(v, n; default=missing)
would have n
missings at the beginning, and diff(v, -n; default=missing)
would have n
missings at the end.
There is a friction point in that it is not super clear to me whether the default
should be used before subtracting (eg, subtract to v
a shifted version of v
padded on one side with default
) or compute diff
and pad the result with default
. The former is easy to implement lazily with ShfitedArrays, see https://github.com/JuliaArrays/ShiftedArrays.jl/pull/51#issuecomment-934871076, but not the latter. The two things are equivalent for default=missing
, but different in general.
Just to emphasize, about missing
s in response to @KristofferC , we don't need to have missing
s here. Ideally an API would allow for any value the user wishes for the unknown differences.
The main goal of this issue (as indicated by the title) is an operation which preserves shape, not necessarily introducing missing
s.
I think @petvana 's idea about prepend
and append
is very good and would get rid of the ambiguity mentioned by Tim.
All, I have updated the initial post in this issue with a proposal of fillfirst
and filllast
which makes no assumption about missing
s as a default value.
What about this instead: introduce a wrap::Bool=false
keyword which changes diff
to produce a vector of the same size where the extra element (at the end) is the difference between the last and first element. Then if someone wants to replace that element with missing
, they can just do an assignment afterwards. That doesn't cover the "prepend" case, but that seems less likely to be what someone wants as it changes the index that ever difference ends up at, whereas putting a[end]-a[1]
at the end leaves all the other differences where they would otherwise be.
Which does bring me to this option: d = diff(v); push!(d, missing)
. Same effect, probably doesn't actually do any additional allocation.
The original motivation for this issue was to prepend missing
. That's actually quite useful in modelling to create a variable giving the increase compared with the previous value: appending missing
would be problematic for a causal interpretation as a future increase would be assigned to the current observation.
FWIW prepending missing
is what R does with just diff(v)
so there's clearly a use case for it (R doesn't even support appending missing
). The NumPy method also accepts prepend
and append
arguments. Wrapping (a.k.a. circular shift) is yet another possibility, but I'm not sure it's the most common -- and anyway we can support all three behaviors.
Also note that d = diff(v); push!(d, missing)
doesn't work as the eltype of d
doesn't support missing
in general.
We should also avoid push!
to better prepare for the immutable future...
Currently
Base.diff(x)
produces a new vector with lengthlength(x) - 1
.This is often annoying when working with tabular data, since you cannot do
@nalimilan recommended a keyword argument to
Base.diff
, which allows for pre-pending a default value such that length, and shape more generally in the case of matrices and other arrays, is preserved.Given the discussion in the issue below, I propose keyword arguments
fillfirst
andfilllast
which indicate the value appended to the array.