Open bkamins opened 2 years ago
What would the sentinel value do, simply provide the value with which to pad? Is this different from the idea put forth on Discourse, or is it a reminder (which would be fine)? If it is different, please recap what the other functionality is -- how it ought behave, Thanks.
This is a new idea. The point is that e.g. rollmean
shortens a vector. While runmean
keeps the vector length, but shortens the window on initial values. Often one needs rollmean
(i.e. no window shortening), but also to keep the original length of the vector (e.g. if you store it in a data frame that has a fixed number of rows). In such cases you need to add some sentinel (usually missing
to make the vector returned by rollmean
have the original length)
I can see how missing
may want to occupy the low indices (in the past) or may want to occupy the high indices (current data, perhaps unresolvable with fewer observations than the window length), does this filler ever have reason to be within the data sequence rather than at the start or at the end? Does this filler ever have reason to be interspersed with the observations -- if so, to what end?
This is how pandas does it in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html: (I did not request for this full functionality, but something simpler, but if you want to investigate it then this is I think a standard reference)
In short:
np.nan
(I proposed to be more flexible and allow any filler) this decision is set in min_periods
parameter (I proposed not to add it rollingmean
assumes that min_periods
is just window size), but we could add it if you wantedcenter
argument (we could have three values - fill at beginning, fill at end, fill evenly on both sides; I do not see a reason to have a filler in the middle)Introducing missing
suggests that each application of a rolled function must first check ismissing
or dispatch differently on encountering x::Missing
. I could keep a displacement count instead, that would have no slow-down. If the displacement were signed to indicate from which place to offset, the early indices or the recent indices, then perhaps your request would be met without disrupting processing. For situations where there may be missing
entries within the body of the data -- no events to observe on a weekend, but all dates are tracked -- a different approach is required. Maybe a choice of LOCF or avg of most recent 2/3/? observations? Any thoughts.
I prefer keeping advances to that which can be stated simply, and having specific requests gel together. Nonetheless, there will be resituating the underlying computational-ness required. So, best to have a clear view of the essentials.
Is the discourse item replaced by this refinement, or is there a distinction I may have forgotten?
So my original proposal was an essential. Currently user needs to do padding manually:
julia> [fill(missing, 2); rollmean(1:5, 3)] # fill at front
5-element Vector{Union{Missing, Float64}}:
missing
missing
2.0
3.0
4.0
or
julia> [missing; rollmean(1:5, 3); missing] # centered
5-element Vector{Union{Missing, Float64}}:
missing
missing
2.0
3.0
4.0
This is quite inconvenient, especially for new users of Julia.
What I would think makes sense:
This is not a restatement of https://discourse.julialang.org/t/rollingfunctions-with-a-variable-window-width/87601. In that thread OP wanted to be able to specify different window width for each observation. So this was a different requirement (this issue is a requirement for padding not for variable width).
I don't know how to read "specify a different window for each observation." Windows span multiple observations by definition. Does this mean at each observation use however many earlier observations to determine the current function value and that however many may be unique to each observation [index]? If so, "no". Your request above is much more widely useful.
Is there a reason to actually pad rather than know where to begin/end? The data sequence belongs to the client. I would be more comfortable providing a very easy, perhaps nearly invisible way to copy with padding inserted rather than grab the information and alter its e.g. type. A keyword or two would allow this. Another function that actually modifies the source data sequence could be exported for advanced users.
roll(fn, window, dataseq; pad = :first, skip = 20)
roll(fn, window, dataseq; padding = (from = :first, count = 20))
then currying or structifying (fn, window) is possible
Does this mean at each observation use however many earlier observations to determine the current function value and that however many may be unique to each observation [index]?
This is what OP on Discourse wanted.
Is there a reason to actually pad rather than know where to begin/end?
This is also OK, but would require an additional step later. The point is often users want to do something like:
julia> df = DataFrame(a=1:5)
5×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
5 │ 5
julia> df.m = rollmean(df.a, 2)
ERROR: ArgumentError: New columns must have the same length as old columns
and now, as you see, they get an error.
They could do:
julia> df.m = runmean(df.a, 2)
5-element Vector{Float64}:
1.0
1.5
2.5
3.5
4.5
julia> df
5×2 DataFrame
Row │ a m
│ Int64 Float64
─────┼────────────────
1 │ 1 1.0
2 │ 2 1.5
3 │ 3 2.5
4 │ 4 3.5
5 │ 5 4.5
but often they want the first observation that does not have a full window size to be some sentinel (like missing
) rather than applied function just with a shorter window.
ahh. ic
@bkamins
I'm developing a fork of RollingFunctions that has padded rolling function mappadroll
: https://github.com/kevindirect/MapSlide.jl
It's a WIP, but it does work.
Thank you for this work.
@JeffreySarnoff - what is the status of your efforts towards 1.0 release of RollingFunctions.jl?
reasonably good -- external constraints put the prerelease at Saturday Jan 21 and the release (or release candidate) at Sunday Feb 5th.
@JeffreySarnoff - I saw a mention in this issue, but now I do not see it, so I am not sure what it was exactly. However, let me re-state the original request. When using rolling functions to get the result of the same length as input vector padded with some sentinel (missing
is a reasonable default).
Got that..
What is bugging me is that I cannot process your views without including
all of data frames. Would you consider a future release that makes separate a least necessary package DataFramesCore.jl
that would do just enough so I could use it and define expansive_view = Union{AbstractArray, AbstractDataFrame}
?
On Wed, Jan 18, 2023, 6:31 AM Bogumił Kamiński @.***> wrote:
@JeffreySarnoff https://github.com/JeffreySarnoff - I saw a mention in this issue, but now I do not see it, so I am not sure what it was exactly. However, let me re-state the original request. When using rolling functions to get the result of the same length as input vector padded with some sentinel (missing is a reasonable default).
— Reply to this email directly, view it on GitHub https://github.com/JeffreySarnoff/RollingFunctions.jl/issues/24#issuecomment-1386906314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM2VRXS7B4ODKSAQGI3HBLWS7H7TANCNFSM6AAAAAAQ5IWG7M . You are receiving this because you were mentioned.Message ID: @.***>
What is bugging me is that I cannot process your views without including all of data frames.
Can you please explain the problem? I would assume that RollingFunctions.jl does not need to be aware of DataFrames.jl at all. It takes a vector as an input and should produce vector in output.
That is true. People have requested that it work with DataFrame column views, and just for general symmetry -- that we have two well developed sorts of view
and there is no common superview type nor a good way to fake it with a Union is less than ideal. I will forget about it though.
Indeed you can get a view as an input to your rolling function. But it should not matter I thought. I assumed that I would have a signature that restricts input type to AbstractVector
without caring if it is a view or not.
Yes, I use AbstractVector{T} and also AbstractMatrix{T} (for running some function over multiple columns at one go).
And why this is not enough?
these are vector examples (fyi)
it is -- really it all goes well. I have been musing about some shallow type hierarchies for Julia v2.
Thank you, https://github.com/JeffreySarnoff/RollingFunctions.jl/blob/v1/docs/src/intro/padding.md seems exactly what is needed (in other frameworks there is also an option to pad in the middle so that half of padding is in front and half of padding is at end).
Conceptually this means that if we move the window is the observation to which we attach the result in:
Padding to the center can be added later if it is a problem (but it would be good to keep in mind that people sometimes want to say that Tuesday is a mean of Monday, Tuesday and Wednesday).
Regarding the types challenge - I discussed the issue with Jeff Bezanson yesterday - and it is a tough thing.
with padding in the middle and an odd number of paddings, which end gets more or is that another settable choice?
I had considered letting people specify n1 pads at the start and n2 pads at the end -- that seemed an opportunity for people to get it wrong.
which end gets more or is that another settable choice?
😄. We seem to think alike. I was copying the functionality that people asked for, but when I did this I started thinking about exactly the same problem. Maybe a more general solution would be to add a numeric padding_offset
(tentative name). Where when it is set to 0
(the default and minimum allowed size) we get padding first, and if you set it to window_size-1
(the maximum allowed size) you get padding last.
(now I see it is similar to what you said with n1
and n2
, but I think it is clearer to use window_size
and offset
).
At the next-to-last minute, I decided to keep compatibility with current code. Realigning to that -- still expect to unveil on Sunday Feb 5th.
Here be something (unregistered while I heal).
using Pkg
Pkg.add(url="https://github.com/JeffreySarnoff/RollingFunctions.jl",rev="main")
The entire edifice has been reworked.
You can pad at the start or at the end with almost any value.
You can copy padding values from a vector, in order.
This helps with connecting one roll to another.
This keeps the result clean when a function is not well defined until it has more than 1, 2, .. values.
The current docs reflect the tests and other working details in need of tests.
IncrementalAccumulators exists to be incorporated, and will replace the older implementations (rollmean
, ...).
this would add substantive performance
I have not been able to figure out how to get @turbo
to work with this. Experiments show @turbo would improve the performance of many internal functions by 2x-3x. In the example below, data
is a view, and using @views on the result
line adds some throughput to the more direct internal functions.
lo = 1
hi = window_span
@turbo for idx in 1:n
result[idx] = window_fn(data[lo:hi])
lo += 1
hi += 1
end
fails with
ERROR: MethodError: no method matching
_gesp(::LayoutPointers.StridedPointer{Float64, 1, …},
::Static.StaticInt{1}, ::UnitRange{Int64}, ::Static.StaticInt{1})
Thank you for working on this. I have checked the code and have the following comments:
You use Nothing
as sentinel. It would be better to define an internal:
struct NoPadding end
structure and set NoPadding()
as a default (and do not export NoPadding
).
The reason is that Nothing
is a valid value in Base Julia. In a case when someone wanted to pad with it it will not work as expected (I agree it is a corner case, but I usually prefer a clean design):
julia> rolling(sum, [1, 2, 3], 2, padding=Missing)
3-element Vector{Union{Int64, DataType}}:
Missing
3
5
julia> rolling(sum, [1, 2, 3], 2, padding=Nothing) # unexpected
2-element Vector{Int64}:
3
5
rolling
errors unexpectedlyHere is an example:
julia> rollmax([1, 2, 3], 2)
2-element Vector{Int64}:
2
3
julia> rolling(max, [1, 2, 3], 2)
ERROR: BoundsError: attempt to access 0-element Vector{Any} at index [1]
I would have expected these two operations to be equivalent (but I might be wrong).
rolling
Example:
julia> rolling(mean, [1, 2, 3], 2, padding=missing)
3-element Vector{Union{Missing, Float64}}:
missing
1.5
2.5
julia> rollmean([1, 2, 3], 2, padding=missing)
ERROR: MethodError: no method matching rollmean(::Vector{Int64}, ::Int64; padding::Missing)
I would expect padding
and padlast
work in both cases
Example:
help?> rolling
search: rolling RollingFunctions rollmin rollmedian rollvariation rollkurtosis rollmad_normalized
No documentation found.
RollingFunctions.rolling is a Function.
I think it would be very useful to add them.
If you would like my help with any of these points please comment and I will gladly do it.
supplemental information attached (to be updated on occasion)
When I researched rolling functions over windowed data initially (2017), there were packages and products that provided padding with a given value. I saw nothing about tapering to provide values for incomplete window spans. To the best of my knowledge (without claiming anything more) RunningFunctions.jl introduced efficient, well-structured, and functionally broad tapering to many. Now, this capability appears in many packages and in multiple languages.
The implementation followed a few stylized approaches, allowing reasonably concise and compact code to cover many sorts of windowed summarization.
throughout
features
positional args
compatible with v0.7
exportedfunction(summarizer, datasequence, windowspan)
Many clients will use some preferred set of windowspans with some selected set of summarization functions, each specialized to the nature of the datasequence. This suggests that the datasequence appear after the windowspan and the summarizer args. These two orderings meet that constraint. Which is preferable in use with respect to currying and maybe some use of closures?
exportedfunction(summarizer, windowspan, datasequence)
exportedfunction(windowspan, summarizer, datasequence)
current best practice
SIMD almost everywhere
Multithreading wherever helpful
current approach
enhancements
with a specified value
with an AbstractVector
with an NTuple?
with a specified value
with an AbstractVector
with an NTuple?
- the unit is _tally of successive indices_
n = length(data[column])
s = window_span
u = window_prespan == s - 1
p = count of indices to pad (0 or initial indices in 1..u)
t = count of indices to taper (0 or u-p indices adjacent to the initial p)
u == p + t ∧ p, t ∈ [0..u]
re: padding is supported only for rolling
This was intentional (not the lack of support) -- the prior versions of roll<op>
(v0.7) are to be improved usiing implementation logic from IncrementalAccumulators. These should perform much better.
They all are of a consistent, flexible design and share a nice api (thank you).
The new, more capable api is call signature compatible with release v0.7.x -- that is, call signatures used with high level exports of v0.7,x (rolling and running) yield very nearly the same results when used with the new api as it is specified at present.
We could support greater generality with fewer lines of code by changing from distinguishing functions of arity1, 2, 3, and 4 by dispatch over 1, 2, 3, or 4 vectors to a Tuple form and capturing the number of args (as given by N, the number of vectors):
op( windowfunc:::Function, windowspan:::Integer,
data::Tuple{Vararg{<:AbstractVector,N}}) where {N}
We would code unary window functions explicitly: (a) providing well-grounded reference implementations code (b) providing evaluative optimizations and applicative speed-ups.
op( windowfunc:::Function, windowspan:::Integer, data::AbstractVector[T}) where {T}
And we could do the same with window functions of 2 args, as e.g. cov
, cor
are widely used
so some code optimization makes sense. While we may choose code 2 arg functions explicitly, that decision is not complexity driven. Using a generalized , shared signature implementation for 2+ arg functions may make better sense. Either way, we process the generalized input using looping with reductive computation in concert with gobbling observations to supply the args simultaneous-ish-ly.
Gaining performance (2x-3x in throughput generally, 5x-10x and in reappearing client situations) is available (sometimes not straightforward, sometimes requires additional dev work with third party packages). Work applying e.g. LoopVectorization, SIMD, ..? ?.. to our task is necessary.
@bkamins The previous note explains my remaining high-level API decision for v1.
(+) I prefer to support rolling/running a multi-arity functions that way.
(-) Clients now rolling cov
or cor
would need to wrap data as 2-tuples.
- (old) `cor(window_span, datavec1, datavec2)
- (new) `cor(window_span, (datavec1, datavec2))
What do see?
I am not fully clear what the benefit of cor(window_span, (datavec1, datavec2))
over cor(window_span, datavec1, datavec2)
is.
Two considerations:
Tuple
notation it seems not consistent if you should pass a single vector unwrapped or wrapped in a tuple like this (vector,)
.For example, looking at the signature of rolling
I imagine that the way you could define new signature could be:
rolling(window_fn<:Function, data::AbstractVector...;
window_span::Int, # kwarg, required
weights::Union{AbstractVector, Nothing}=nothing, # kwarg, optional
padding=NoPadding(): # NoPadding unexported sentinel type instance,
padlast::Bool=false)
also with this approach the pre-1.0 release signatures could be kept as deprecated
Tuple wrapping the two data vectors provided for two-arg window_fns is less compelling than is tuple wrapping for 3..12 arg window_fns. Clients working with timestamped and time-aggregated (the reading at 5min intervals or the trimmed mean taken through 5min intervals) financial market data often obtain many field values for every stamped or marked time.
These 6 data items provide market information about the current trading day as it concludes: PriorClose, Open, High, Low, Close, Volume. Estimators, Indicators, and Accuracy/Loss Measures are developed through their Arithmetic and Algebraic interconnections. The more compelling use provides clients with a semantically consistent, syntactically uniform, generalized, and robust mechanism for working with the whole of it through any parts partitioning deemed worthwhile.
Tuple wrapping the two data vectors provided for two-arg window_fns is less compelling than is tuple wrapping for 3..12 arg window_fns.
Can you please explain on an example what you mean exactly?
If we discuss convenience then I believe that not requiring to wrap values in a tuple is more convenient. The reason is that if you have a tuple you can easily splat it. However, if you do not have a tuple and it is required then you would need to create it.
Why imo window_span
is better as an early positional arg than as a kwarg:
window_fn
and window_span
in
rolling(window_fn<:Function, window_span::Int, args...; kwargs...)
are a natural subspecification pairing (roll
or run
some fn over some span for various data) it is convenient to simplify currying over the two when reusing an approach that has been developed for those specs (e.g. a 200-day exponential moving average).
There may be more than one data vector and there may be one weighting for each data vector [multiple weight vecs is a feature that seems would simplify some windowed summarizations]. Only one sort can be a Vararg. The data vecs are likely to change more often than the weight vecs. However weight vecs are not always used, while data vecs are. Having the weight vec[s] available as a positional arg simplifies dispatch somewhat -- so perhaps
rolling(window_fn<:Function, window_span::Int, data::AbstractVector...; padding=NoPadding(): # NoPadding unexported sentinel type instance, weights::Union{AbstractVector, NTuple{N,AbstractVector}, Nothing}=nothing, # kwarg, optional )
OK - I see your point. Let me ask on Slack on #data what users would think. Thank you!
Ah - and also we discussed that the window_span
could be a vector (different window span for different observations).
Just to bring over what I said on Slack:
I voted for positional arg originally, as I thought footguns of the type optimize(f, lower, upper, initial)
where one easily can confuse the three last positional args don't apply given that window_span
is an integer. I didn't realize that both window_span
and data
can be vectors now, somewhat increasing the likelihood of footguns.
I then wondered whether there are any constraints about the arguments which could be used to error early and prevent users from silently getting wrong results, e.g. if we require length(data) > length(window_span)
that could be used to prevent users getting it wrong.
My assumption is that the vast majority of use cases is still the simple rolling(f, 3, data)
so it would be nice to keep the more succinct positional arg version.
Would the integration of the equivalent of R's roll
's min_obs
be desirable?
I for example have use cases where I want to roll over a length of given width, but still prefer to have non missing returned if less data than the width is present:
> roll_sum(1:3, width = 2)
[1] NA 3 5
> roll_sum(1:3, width = 2, min_obs = 1)
[1] 1 3 5
> roll_sum(1:3, width = 4, min_obs = 2)
[1] NA 3 6
Note that in the last case, although it may seem wrong to specify a width longer than the actual data, such situation may legitimately happens in cases where such rolling function is passed to a grouped DataFrame whose length isn't guaranteed to be >= window_span
.
To roll a function F over a data vector of length N using a window_span of W where W > N is the same as applying F to the (entire) data vector. That is reasonable iff N is large enough so F(data) is numeric (e.g. N > k for some k that depends on the nature of F .. which of F([]), F([x]), F([x1,x2]) is legit). The responsibility would be on the caller.
Note that there are analytic issues with doing this; consider situations where the result is compared or combined with other results that have been applied to data of length N >= W. This "feature" should not be used in such situations. I think it may be overkill to provide a keyword arg to permit this -- so it would be all on the user. For less experienced users, that is problematic. What do you think?
from @nilshg : Maybe my point wasn't clear - length(window_span) < length(data) to me means the vector of window spans is shorter than the data, i.e. it wouldn't make sense to supply 10 window lenghts if I only have 5 observations.
Functions will ignore any extra spans (spans in a vector of window spans beyond where cumsum(spans) exceeds length(data). If the last useful span extends beyond length(data), either the prior span could be extended or the that last span could be shrunk or that last span could be trimmed. Preferences?
I'm not strongly opiniated on this, just thought that min_obs
could be a nice to have flexibility. Since it's meant to be an optional kwarg that defaults to width
, I wouldn't expect this to be exposed so much to foot shooting as the user needs to be intentional in its usage of this option.
There was no post giving an example for wanting to process distinct window_spans while traversing a vector (e.g. the first 12 obs, then the next 18 obs ..). My guess is that the request was to process a sequence of window_spans, one-at-a-time, without explicitly coding the loop.
If it is not clear how winow_spans
as a vector should be processed now I guess we can skip it, but I would propose to have a design that ensures it can be added (in general: whatever is available in R or pandas sooner or later will be asked for from my DataFrames.jl experience 😄).
One way to allow support for that: there would be a kwarg windows_spans
or some such that would override the positional arg if it were set to a vector or tuple.
Otherwise, it becomes difficult to know the data vectors from the span vector (without always making the span a kwarg -- which we discussed above)
Example:
It would be nice to add a kwarg to
roll*
to allow such padding, e.g. by writingrollmean(1:5, 3; pad=missing)
.