JeffreySarnoff / RollingFunctions.jl

Roll a window over data; apply a function over the window.
MIT License
117 stars 7 forks source link

version 1 #33

Open JeffreySarnoff opened 1 year ago

JeffreySarnoff commented 1 year ago

This is for discussion of issues specific to the design of version 1. @bkamins

JeffreySarnoff commented 1 year ago

@bkamins version 1 is where the padding stuff happens. along the way a few questions arose.

I am adopting an "accumulator" approach to statistics available incrementally. e.g. accum = AccMax(); accum = AccMeanVar()

In the first example, with each new x the current value of the accumulator is immediately available and can be returned with little overhead. In the second example, variance is computed using fields internal to the accumulator. With each new x the current value of the accumulator is a 2-tuple (mean, variance), and it is not immediately available, returning it has some overhead.

For all accumulators, accum() does provide the current value[s] being accumulated. Is it better to return nothing for all calls like accum(x) and let the client get the current value[s] with accum() when desired, or to return accum() at the end of accum(x)?

bkamins commented 1 year ago

What you plan for is very similar to https://github.com/joshday/OnlineStats.jl, so you might want to have a look at that implementation also.

Regarding your question - I think accum(x) should not return the statistic as it would add an overhead as you comment. Instead I think the natural thing to do would be for accum(x) to return accum. The reason is that it makes chaining easiest. Also you can then write accum(x)() if you want to get the value immediately. Finally you probably design custom printing for accum that does compute the statistic always (as printing is expensive anyway). This means that when you call accum(x) in REPL you get the value of the statistic printed anyway (but when accum(x) is not displayed the statistic is not computed immediately).

JeffreySarnoff commented 1 year ago

That's a good suggestion about accum(x)() I am familiar with OnlineStats, and at first considered just using that pkg -- I would rather allow it than subsume it, there were some fit issues.

JeffreySarnoff commented 1 year ago

[@juliohm @bkamins] I am continuing the discussion at TableTransforms #121 here, to keep the information for RollingFunctions more contiguous.

Tables.jl, TableTransforms.jl, DataFrames.jl and .. understand e.g. xs::Vector{Float32} [where "understanding" is operational and abstractly applicative] as a realization of some AbstractColumn or RowAbstraction. That is helpful, as they come laden with capability, operational élan, reliability, and a dollop of correctness.

My perspective on rolling functions over windows into data and transformations is that none of the following should be excluded and all should be similarly constructable and useable in a shared way.

In addition, the ability to pre- or post- pad with given value or with a sequence of determinate values (tapering) must be available and essentially effortless.

The first level of rollable functions are directly implementations of incremental algorithms that update the functional value (e.g. a descriptive statistic) with each next step within the windowed data. Each of these is performant. What is necessary both for this package and for seamless use with DataFrames and TableTransforms, is to support melding two or more first level capabilities (incremental updating of the extrema, the mean, and an exponentially weighted mean) rather than simply stacking them. Wrapping them in a pipe that pumps each new observation through the shared API would work and offers the potential to use multiple threads effectively.

OnlineStats.jl is not restricted to incremental updating, covers most of the first level descriptive functions in a similar way, and many more. It is important to let those stats be used (made rollable). The intent of that package is to process and provide with a single look at the items within a data[stream]. Rolling over windowed data involves structural subsequences by definition. So there is interplay, and smooth interuse takes careful consideration. @joshday

JeffreySarnoff commented 1 year ago

My current approach to incremental stats is shown for rolling minimum in this gist.

JeffreySarnoff commented 1 year ago

This gist shows my current approach to incremental stats with optional stream element preprocessing, again using rolling minimum.

JeffreySarnoff commented 1 year ago

These are the "single stepped incremental statistics" available I need to add some two+ argument incremental stats (cov, corr) [open to suggestion add? remove?]

    AccMinimum, AccMaximum, AccExtrema,
    AccSum, AccProd,
    AccMean, AccGeoMean, AccHarmMean,
    AccMeanVar, AccMeanStd, AccStats,

    AccMinimumAbs, AccMaximumAbs, AccExtremaAbs,
    AccSumAbs, AccProdAbs,

    # exponentially weighted versions
    # these also initialize  α, the decay parameter
    #    either directly or via span, halflife, center of mass
    #    there is auto initialization logic too

    AccMinimumEW, AccMaximumEW, AccExtremaEW,
    AccSumEW, AccProdEW,
    AccMeanEW, AccGeoMeanEW, AccHarmMeanEW,
    AccMeanVarEW, AccMeanStdEW, AccStatsEW,

    AccMinimumEW, AccMaximumEW, AccExtremaAbsEW,
    AccSumAbsEW, AccProdAbsEW,

not shown (not yet finalized, part of the windowing facilities) are the use of other / arbitrary normalized weights

JeffreySarnoff commented 1 year ago

This shows an accumulator with local short term memory.