JeffreySarnoff / RollingFunctions.jl

Roll a window over data; apply a function over the window.
MIT License
114 stars 6 forks source link

Add support for padding with sentinel value for roll* #24

Open bkamins opened 1 year ago

bkamins commented 1 year ago

Example:

julia> rollmean(1:5, 3)
3-element Vector{Float64}:
 2.0
 3.0
 4.0

julia> runmean(1:5, 3)
5-element Vector{Float64}:
 1.0
 1.5
 2.0
 3.0
 4.0

julia> [fill(missing, 2); rollmean(1:5, 3)]
5-element Vector{Union{Missing, Float64}}:
  missing
  missing
 2.0
 3.0
 4.0

It would be nice to add a kwarg to roll* to allow such padding, e.g. by writing rollmean(1:5, 3; pad=missing).

JeffreySarnoff commented 1 year ago

What would the sentinel value do, simply provide the value with which to pad? Is this different from the idea put forth on Discourse, or is it a reminder (which would be fine)? If it is different, please recap what the other functionality is -- how it ought behave, Thanks.

bkamins commented 1 year ago

This is a new idea. The point is that e.g. rollmean shortens a vector. While runmean keeps the vector length, but shortens the window on initial values. Often one needs rollmean (i.e. no window shortening), but also to keep the original length of the vector (e.g. if you store it in a data frame that has a fixed number of rows). In such cases you need to add some sentinel (usually missing to make the vector returned by rollmean have the original length)

JeffreySarnoff commented 1 year ago

I can see how missing may want to occupy the low indices (in the past) or may want to occupy the high indices (current data, perhaps unresolvable with fewer observations than the window length), does this filler ever have reason to be within the data sequence rather than at the start or at the end? Does this filler ever have reason to be interspersed with the observations -- if so, to what end?

bkamins commented 1 year ago

This is how pandas does it in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html: image (I did not request for this full functionality, but something simpler, but if you want to investigate it then this is I think a standard reference)

In short:

JeffreySarnoff commented 1 year ago

Introducing missing suggests that each application of a rolled function must first check ismissing or dispatch differently on encountering x::Missing. I could keep a displacement count instead, that would have no slow-down. If the displacement were signed to indicate from which place to offset, the early indices or the recent indices, then perhaps your request would be met without disrupting processing. For situations where there may be missing entries within the body of the data -- no events to observe on a weekend, but all dates are tracked -- a different approach is required. Maybe a choice of LOCF or avg of most recent 2/3/? observations? Any thoughts.

JeffreySarnoff commented 1 year ago

I prefer keeping advances to that which can be stated simply, and having specific requests gel together. Nonetheless, there will be resituating the underlying computational-ness required. So, best to have a clear view of the essentials.

Is the discourse item replaced by this refinement, or is there a distinction I may have forgotten?

bkamins commented 1 year ago

So my original proposal was an essential. Currently user needs to do padding manually:

julia> [fill(missing, 2); rollmean(1:5, 3)] # fill at front
5-element Vector{Union{Missing, Float64}}:
  missing
  missing
 2.0
 3.0
 4.0

or

julia> [missing; rollmean(1:5, 3); missing] # centered
5-element Vector{Union{Missing, Float64}}:
  missing
  missing
 2.0
 3.0
 4.0

This is quite inconvenient, especially for new users of Julia.

What I would think makes sense:


This is not a restatement of https://discourse.julialang.org/t/rollingfunctions-with-a-variable-window-width/87601. In that thread OP wanted to be able to specify different window width for each observation. So this was a different requirement (this issue is a requirement for padding not for variable width).

JeffreySarnoff commented 1 year ago

I don't know how to read "specify a different window for each observation." Windows span multiple observations by definition. Does this mean at each observation use however many earlier observations to determine the current function value and that however many may be unique to each observation [index]? If so, "no". Your request above is much more widely useful.

JeffreySarnoff commented 1 year ago

Is there a reason to actually pad rather than know where to begin/end? The data sequence belongs to the client. I would be more comfortable providing a very easy, perhaps nearly invisible way to copy with padding inserted rather than grab the information and alter its e.g. type. A keyword or two would allow this. Another function that actually modifies the source data sequence could be exported for advanced users.

roll(fn, window, dataseq; pad = :first, skip = 20) roll(fn, window, dataseq; padding = (from = :first, count = 20))

then currying or structifying (fn, window) is possible

bkamins commented 1 year ago

Does this mean at each observation use however many earlier observations to determine the current function value and that however many may be unique to each observation [index]?

This is what OP on Discourse wanted.

bkamins commented 1 year ago

Is there a reason to actually pad rather than know where to begin/end?

This is also OK, but would require an additional step later. The point is often users want to do something like:

julia> df = DataFrame(a=1:5)
5×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4
   5 │     5

julia> df.m = rollmean(df.a, 2)
ERROR: ArgumentError: New columns must have the same length as old columns

and now, as you see, they get an error.

They could do:

julia> df.m = runmean(df.a, 2)
5-element Vector{Float64}:
 1.0
 1.5
 2.5
 3.5
 4.5

julia> df
5×2 DataFrame
 Row │ a      m
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.0
   2 │     2      1.5
   3 │     3      2.5
   4 │     4      3.5
   5 │     5      4.5

but often they want the first observation that does not have a full window size to be some sentinel (like missing) rather than applied function just with a shorter window.

JeffreySarnoff commented 1 year ago

ahh. ic

kpa28-git commented 1 year ago

@bkamins I'm developing a fork of RollingFunctions that has padded rolling function mappadroll: https://github.com/kevindirect/MapSlide.jl It's a WIP, but it does work.

bkamins commented 1 year ago

Thank you for this work.

@JeffreySarnoff - what is the status of your efforts towards 1.0 release of RollingFunctions.jl?

JeffreySarnoff commented 1 year ago

reasonably good -- external constraints put the prerelease at Saturday Jan 21 and the release (or release candidate) at Sunday Feb 5th.

bkamins commented 1 year ago

@JeffreySarnoff - I saw a mention in this issue, but now I do not see it, so I am not sure what it was exactly. However, let me re-state the original request. When using rolling functions to get the result of the same length as input vector padded with some sentinel (missing is a reasonable default).

JeffreySarnoff commented 1 year ago

Got that.. What is bugging me is that I cannot process your views without including all of data frames. Would you consider a future release that makes separate a least necessary package DataFramesCore.jl that would do just enough so I could use it and define expansive_view = Union{AbstractArray, AbstractDataFrame}?

On Wed, Jan 18, 2023, 6:31 AM Bogumił Kamiński @.***> wrote:

@JeffreySarnoff https://github.com/JeffreySarnoff - I saw a mention in this issue, but now I do not see it, so I am not sure what it was exactly. However, let me re-state the original request. When using rolling functions to get the result of the same length as input vector padded with some sentinel (missing is a reasonable default).

— Reply to this email directly, view it on GitHub https://github.com/JeffreySarnoff/RollingFunctions.jl/issues/24#issuecomment-1386906314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM2VRXS7B4ODKSAQGI3HBLWS7H7TANCNFSM6AAAAAAQ5IWG7M . You are receiving this because you were mentioned.Message ID: @.***>

bkamins commented 1 year ago

What is bugging me is that I cannot process your views without including all of data frames.

Can you please explain the problem? I would assume that RollingFunctions.jl does not need to be aware of DataFrames.jl at all. It takes a vector as an input and should produce vector in output.

JeffreySarnoff commented 1 year ago

That is true. People have requested that it work with DataFrame column views, and just for general symmetry -- that we have two well developed sorts of view and there is no common superview type nor a good way to fake it with a Union is less than ideal. I will forget about it though.

bkamins commented 1 year ago

Indeed you can get a view as an input to your rolling function. But it should not matter I thought. I assumed that I would have a signature that restricts input type to AbstractVector without caring if it is a view or not.

JeffreySarnoff commented 1 year ago

Yes, I use AbstractVector{T} and also AbstractMatrix{T} (for running some function over multiple columns at one go).

bkamins commented 1 year ago

And why this is not enough?

JeffreySarnoff commented 1 year ago

these are vector examples (fyi)

JeffreySarnoff commented 1 year ago

it is -- really it all goes well. I have been musing about some shallow type hierarchies for Julia v2.

bkamins commented 1 year ago

Thank you, https://github.com/JeffreySarnoff/RollingFunctions.jl/blob/v1/docs/src/intro/padding.md seems exactly what is needed (in other frameworks there is also an option to pad in the middle so that half of padding is in front and half of padding is at end).

Conceptually this means that if we move the window is the observation to which we attach the result in:

Padding to the center can be added later if it is a problem (but it would be good to keep in mind that people sometimes want to say that Tuesday is a mean of Monday, Tuesday and Wednesday).

Regarding the types challenge - I discussed the issue with Jeff Bezanson yesterday - and it is a tough thing.

JeffreySarnoff commented 1 year ago

with padding in the middle and an odd number of paddings, which end gets more or is that another settable choice?

JeffreySarnoff commented 1 year ago

I had considered letting people specify n1 pads at the start and n2 pads at the end -- that seemed an opportunity for people to get it wrong.

bkamins commented 1 year ago

which end gets more or is that another settable choice?

😄. We seem to think alike. I was copying the functionality that people asked for, but when I did this I started thinking about exactly the same problem. Maybe a more general solution would be to add a numeric padding_offset (tentative name). Where when it is set to 0 (the default and minimum allowed size) we get padding first, and if you set it to window_size-1 (the maximum allowed size) you get padding last.

(now I see it is similar to what you said with n1 and n2, but I think it is clearer to use window_size and offset).

JeffreySarnoff commented 1 year ago

At the next-to-last minute, I decided to keep compatibility with current code. Realigning to that -- still expect to unveil on Sunday Feb 5th.

JeffreySarnoff commented 1 year ago

Here be something (unregistered while I heal).

using Pkg
Pkg.add(url="https://github.com/JeffreySarnoff/RollingFunctions.jl",rev="main")

The entire edifice has been reworked.
You can pad at the start or at the end with almost any value. You can copy padding values from a vector, in order. This helps with connecting one roll to another. This keeps the result clean when a function is not well defined until it has more than 1, 2, .. values.

The current docs reflect the tests and other working details in need of tests.

IncrementalAccumulators exists to be incorporated, and will replace the older implementations (rollmean, ...).



this would add substantive performance

I have not been able to figure out how to get @turbo to work with this. Experiments show @turbo would improve the performance of many internal functions by 2x-3x. In the example below, data is a view, and using @views on the result line adds some throughput to the more direct internal functions.

lo = 1
hi = window_span
@turbo for idx in 1:n
  result[idx] = window_fn(data[lo:hi])
  lo += 1
  hi += 1
end

fails with

ERROR: MethodError: no method matching 
    _gesp(::LayoutPointers.StridedPointer{Float64, 1, …}, 
              ::Static.StaticInt{1}, ::UnitRange{Int64}, ::Static.StaticInt{1})
bkamins commented 1 year ago

Thank you for working on this. I have checked the code and have the following comments:

Sentinel value in padding

You use Nothing as sentinel. It would be better to define an internal:

struct NoPadding end

structure and set NoPadding() as a default (and do not export NoPadding).

The reason is that Nothing is a valid value in Base Julia. In a case when someone wanted to pad with it it will not work as expected (I agree it is a corner case, but I usually prefer a clean design):

julia> rolling(sum, [1, 2, 3], 2, padding=Missing)
3-element Vector{Union{Int64, DataType}}:
  Missing
 3
 5

julia> rolling(sum, [1, 2, 3], 2, padding=Nothing) # unexpected
2-element Vector{Int64}:
 3
 5

In some cases rolling errors unexpectedly

Here is an example:

julia> rollmax([1, 2, 3], 2)
2-element Vector{Int64}:
 2
 3

julia> rolling(max, [1, 2, 3], 2)
ERROR: BoundsError: attempt to access 0-element Vector{Any} at index [1]

I would have expected these two operations to be equivalent (but I might be wrong).

padding is supported only for rolling

Example:

julia> rolling(mean, [1, 2, 3], 2, padding=missing)
3-element Vector{Union{Missing, Float64}}:
  missing
 1.5
 2.5

julia> rollmean([1, 2, 3], 2, padding=missing)
ERROR: MethodError: no method matching rollmean(::Vector{Int64}, ::Int64; padding::Missing)

I would expect padding and padlast work in both cases

In general functions lack docstrings

Example:

help?> rolling
search: rolling RollingFunctions rollmin rollmedian rollvariation rollkurtosis rollmad_normalized

  No documentation found.

  RollingFunctions.rolling is a Function.

I think it would be very useful to add them.


If you would like my help with any of these points please comment and I will gladly do it.

JeffreySarnoff commented 1 year ago

supplemental information attached (to be updated on occasion)

RollingFunctions.jl v1

notes for JeffreySarnoff, bkamins 2023-02-15

Looking Back

tapering

When I researched rolling functions over windowed data initially (2017), there were packages and products that provided padding with a given value. I saw nothing about tapering to provide values for incomplete window spans. To the best of my knowledge (without claiming anything more) RunningFunctions.jl introduced efficient, well-structured, and functionally broad tapering to many. Now, this capability appears in many packages and in multiple languages.

genericity

The implementation followed a few stylized approaches, allowing reasonably concise and compact code to cover many sorts of windowed summarization.

omissions

most often noted

others of import


Looking Forward

design goals

api consistancy

argument order

exportedfunction(summarizer, datasequence, windowspan)

Many clients will use some preferred set of windowspans with some selected set of summarization functions, each specialized to the nature of the datasequence. This suggests that the datasequence appear after the windowspan and the summarizer args. These two orderings meet that constraint. Which is preferable in use with respect to currying and maybe some use of closures?

exportedfunction(summarizer, windowspan, datasequence)
exportedfunction(windowspan, summarizer, datasequence)

performance

Technical Notes

taper

pad

pad then taper

consistant relationships
- the unit is _tally of successive indices_

n = length(data[column])

s = window_span
u = window_prespan == s - 1

p = count of indices to pad (0 or initial indices in 1..u)
t = count of indices to taper (0 or u-p indices adjacent to the initial p)

u == p + t   ∧   p, t ∈ [0..u]
JeffreySarnoff commented 1 year ago

re: padding is supported only for rolling This was intentional (not the lack of support) -- the prior versions of roll<op> (v0.7) are to be improved usiing implementation logic from IncrementalAccumulators. These should perform much better.

They all are of a consistent, flexible design and share a nice api (thank you).

JeffreySarnoff commented 1 year ago

The new, more capable api is call signature compatible with release v0.7.x -- that is, call signatures used with high level exports of v0.7,x (rolling and running) yield very nearly the same results when used with the new api as it is specified at present.

We could support greater generality with fewer lines of code by changing from distinguishing functions of arity1, 2, 3, and 4 by dispatch over 1, 2, 3, or 4 vectors to a Tuple form and capturing the number of args (as given by N, the number of vectors):

op( windowfunc:::Function, windowspan:::Integer, 
      data::Tuple{Vararg{<:AbstractVector,N}})  where {N}

We would code unary window functions explicitly: (a) providing well-grounded reference implementations code (b) providing evaluative optimizations and applicative speed-ups.

op( windowfunc:::Function, windowspan:::Integer, data::AbstractVector[T})  where {T}

And we could do the same with window functions of 2 args, as e.g. cov, cor are widely used so some code optimization makes sense. While we may choose code 2 arg functions explicitly, that decision is not complexity driven. Using a generalized , shared signature implementation for 2+ arg functions may make better sense. Either way, we process the generalized input using looping with reductive computation in concert with gobbling observations to supply the args simultaneous-ish-ly.

Gaining performance (2x-3x in throughput generally, 5x-10x and in reappearing client situations) is available (sometimes not straightforward, sometimes requires additional dev work with third party packages). Work applying e.g. LoopVectorization, SIMD, ..? ?.. to our task is necessary.

JeffreySarnoff commented 1 year ago

@bkamins The previous note explains my remaining high-level API decision for v1.

(+) I prefer to support rolling/running a multi-arity functions that way. (-) Clients now rolling cov or cor would need to wrap data as 2-tuples.

 - (old) `cor(window_span, datavec1, datavec2)
 - (new) `cor(window_span, (datavec1, datavec2))

What do see?

bkamins commented 1 year ago

I am not fully clear what the benefit of cor(window_span, (datavec1, datavec2)) over cor(window_span, datavec1, datavec2) is.

Two considerations:

  1. With Tuple notation it seems not consistent if you should pass a single vector unwrapped or wrapped in a tuple like this (vector,).
  2. Also you do not mention how you would want to pass weight vectors, which I think also should be thought of.
  3. Finally, I think we should try to ensure that we can deprecate old syntax (but keep supporting it with depwarn + add a new syntax).

For example, looking at the signature of rolling I imagine that the way you could define new signature could be:

rolling(window_fn<:Function, data::AbstractVector...;
        window_span::Int, # kwarg, required
        weights::Union{AbstractVector, Nothing}=nothing, # kwarg, optional
        padding=NoPadding(): # NoPadding unexported sentinel type instance,
        padlast::Bool=false)

also with this approach the pre-1.0 release signatures could be kept as deprecated

JeffreySarnoff commented 1 year ago

Tuple wrapping the two data vectors provided for two-arg window_fns is less compelling than is tuple wrapping for 3..12 arg window_fns. Clients working with timestamped and time-aggregated (the reading at 5min intervals or the trimmed mean taken through 5min intervals) financial market data often obtain many field values for every stamped or marked time.

These 6 data items provide market information about the current trading day as it concludes: PriorClose, Open, High, Low, Close, Volume. Estimators, Indicators, and Accuracy/Loss Measures are developed through their Arithmetic and Algebraic interconnections. The more compelling use provides clients with a semantically consistent, syntactically uniform, generalized, and robust mechanism for working with the whole of it through any parts partitioning deemed worthwhile.

bkamins commented 1 year ago

Tuple wrapping the two data vectors provided for two-arg window_fns is less compelling than is tuple wrapping for 3..12 arg window_fns.

Can you please explain on an example what you mean exactly?

If we discuss convenience then I believe that not requiring to wrap values in a tuple is more convenient. The reason is that if you have a tuple you can easily splat it. However, if you do not have a tuple and it is required then you would need to create it.

JeffreySarnoff commented 1 year ago

Why imo window_span is better as an early positional arg than as a kwarg: window_fn and window_span in rolling(window_fn<:Function, window_span::Int, args...; kwargs...) are a natural subspecification pairing (roll or run some fn over some span for various data) it is convenient to simplify currying over the two when reusing an approach that has been developed for those specs (e.g. a 200-day exponential moving average).

There may be more than one data vector and there may be one weighting for each data vector [multiple weight vecs is a feature that seems would simplify some windowed summarizations]. Only one sort can be a Vararg. The data vecs are likely to change more often than the weight vecs. However weight vecs are not always used, while data vecs are. Having the weight vec[s] available as a positional arg simplifies dispatch somewhat -- so perhaps

rolling(window_fn<:Function, window_span::Int, data::AbstractVector...; padding=NoPadding(): # NoPadding unexported sentinel type instance, weights::Union{AbstractVector, NTuple{N,AbstractVector}, Nothing}=nothing, # kwarg, optional )

bkamins commented 1 year ago

OK - I see your point. Let me ask on Slack on #data what users would think. Thank you!

bkamins commented 1 year ago

Ah - and also we discussed that the window_span could be a vector (different window span for different observations).

nilshg commented 1 year ago

Just to bring over what I said on Slack:

I voted for positional arg originally, as I thought footguns of the type optimize(f, lower, upper, initial) where one easily can confuse the three last positional args don't apply given that window_span is an integer. I didn't realize that both window_span and data can be vectors now, somewhat increasing the likelihood of footguns.

I then wondered whether there are any constraints about the arguments which could be used to error early and prevent users from silently getting wrong results, e.g. if we require length(data) > length(window_span) that could be used to prevent users getting it wrong.

My assumption is that the vast majority of use cases is still the simple rolling(f, 3, data) so it would be nice to keep the more succinct positional arg version.

jeremiedb commented 1 year ago

Would the integration of the equivalent of R's roll's min_obs be desirable? I for example have use cases where I want to roll over a length of given width, but still prefer to have non missing returned if less data than the width is present:

> roll_sum(1:3, width = 2)
[1] NA  3  5
> roll_sum(1:3, width = 2, min_obs = 1)
[1] 1 3 5
> roll_sum(1:3, width = 4, min_obs = 2)
[1] NA  3  6

Note that in the last case, although it may seem wrong to specify a width longer than the actual data, such situation may legitimately happens in cases where such rolling function is passed to a grouped DataFrame whose length isn't guaranteed to be >= window_span.

JeffreySarnoff commented 1 year ago

To roll a function F over a data vector of length N using a window_span of W where W > N is the same as applying F to the (entire) data vector. That is reasonable iff N is large enough so F(data) is numeric (e.g. N > k for some k that depends on the nature of F .. which of F([]), F([x]), F([x1,x2]) is legit). The responsibility would be on the caller.


Note that there are analytic issues with doing this; consider situations where the result is compared or combined with other results that have been applied to data of length N >= W. This "feature" should not be used in such situations. I think it may be overkill to provide a keyword arg to permit this -- so it would be all on the user. For less experienced users, that is problematic. What do you think?

JeffreySarnoff commented 1 year ago

from @nilshg : Maybe my point wasn't clear - length(window_span) < length(data) to me means the vector of window spans is shorter than the data, i.e. it wouldn't make sense to supply 10 window lenghts if I only have 5 observations.

Functions will ignore any extra spans (spans in a vector of window spans beyond where cumsum(spans) exceeds length(data). If the last useful span extends beyond length(data), either the prior span could be extended or the that last span could be shrunk or that last span could be trimmed. Preferences?

jeremiedb commented 1 year ago

I'm not strongly opiniated on this, just thought that min_obs could be a nice to have flexibility. Since it's meant to be an optional kwarg that defaults to width, I wouldn't expect this to be exposed so much to foot shooting as the user needs to be intentional in its usage of this option.

JeffreySarnoff commented 1 year ago

There was no post giving an example for wanting to process distinct window_spans while traversing a vector (e.g. the first 12 obs, then the next 18 obs ..). My guess is that the request was to process a sequence of window_spans, one-at-a-time, without explicitly coding the loop.

bkamins commented 1 year ago

If it is not clear how winow_spans as a vector should be processed now I guess we can skip it, but I would propose to have a design that ensures it can be added (in general: whatever is available in R or pandas sooner or later will be asked for from my DataFrames.jl experience 😄).

JeffreySarnoff commented 1 year ago

One way to allow support for that: there would be a kwarg windows_spans or some such that would override the positional arg if it were set to a vector or tuple.

JeffreySarnoff commented 1 year ago

Otherwise, it becomes difficult to know the data vectors from the span vector (without always making the span a kwarg -- which we discussed above)