rolling functions, rolling aggregates, sliding window, moving average

jangorecki commented 6 years ago

To gather requirements in single place and refresh ~4 years old discussions creating this issue to cover rolling functions feature (also known as rolling aggregates, sliding window or moving average/moving aggregates).

rolling functions

[x] rollmean
[x] rollsum
[ ] rollmin (#5682)
[ ] rollmax (#5441)
[ ] rollmedian (#5692)
[ ] rollprod (#5682)
[ ] rollsd
[ ] rollvar
[x] rollapply (user provided FUN) (rewritten for more features in #5575)

features

[x] multiple columns at once
[x] multiple windows at once
[x] multiple columns and multiple windows at once
[x] atomic vectors input and single window returns atomic vectors
[x] various length list of vectors
[x] align: left/center/right
[x] handling NAs
[ ] handling Inf (#5441)
[x] fill constant
[x] long vector support
[ ] partial window support (#4968, PR #5441)
[x] adaptive window support
[x] use openmp to parallelize calculation of multiple columns/windows
[x] rounding error correction
[x] timing in verbose mode from parallel region (blocked by ~#3422~, #3423)
[ ] partial & adaptive support (#5441)
[ ] give.names argument, same as shift has (#5441)
[ ] frollapply by.column=FALSE (issue #4887, PR #5575)
[ ] unevenly spaced time series (#3241, PR #5576)

jangorecki commented 6 years ago

Proposed rollmean implementation, simplified.

x = data.table(v1=1:5, v2=1:5)
k = c(2, 3)

i - single column
j - single window
m - int referring to single row
w - current row's sum of rolling window
r - answer for each i, j

for i in x
  for j in k
  r = NA_real_
  w = 0
    for m in 1:length(i)
      w = w + i[m]
      w = w - i[m-j]
      r[m] = w / j

MichaelChirico commented 6 years ago

yes, and many more rolled functions follow the same basic idea (including rolling standard deviation/any expectation-based moment, and any function like rollproduct that uses invertible * instead of + to aggregate within the window

st-pasha commented 6 years ago

I always envisioned rolling window functionality as grouping the dataset into multiple overlapping groups (windows). Then the API would look something like this:

DT[i, j,
   by = roll(width=5, align="center")]

Then if j contains, say, mean(A), we can internally replace it with rollmean(A) -- exactly like we are doing with gmean() right now. Or j can contain an arbitrarily complicated functionality (say, run a regression for each window), in which case we'd supply .SD data.table to it -- exactly like we do with groups right now.

This way there's no need to introduce 10+ new functions, just one. And it feels data.table-y in spirit too.

MichaelChirico commented 6 years ago

yes, agree

On Sat, Apr 21, 2018, 3:38 PM Pasha Stetsenko notifications@github.com wrote:

I always envisioned rolling window functionality as grouping the dataset into multiple overlapping groups (windows). Then the API would look something like this:

DT[i, j, by = roll(width=5, align="center")]

Then if j contains, say, mean(A), we can internally replace it with rollmean(A) -- exactly like we are doing with gmean() right now. Or j can contain an arbitrarily complicated functionality (say, run a regression for each window), in which case we'd supply .SD data.table to it -- exactly like we do with groups right now.

This way there's no need to introduce 10+ new functions, just one. And it feels data.table-y in spirit too.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/2778#issuecomment-383275134, or mute the thread https://github.com/notifications/unsubscribe-auth/AHQQdbADiE4aAI1qPxPnFXUM5gR-0w2Tks5tquH8gaJpZM4TeTQf .

jangorecki commented 6 years ago

@st-pasha interesting idea, looks like data.table-y spirit, but it will impose many limitations, and isn't really appropriate for this category of functions.

how to perform rollmean by group
```
DT[, rollmean(V1, 3), by=V2]
```
how to calculate different window sizes for different columns
```
DT[, .(rollmean(V1, 3), rollmean(V2, 100))]
```
how to calculate rollmean outside of [.data.table as we now allow for shift
```
rollmean(rnorm(10), 3)
```

how to support queries like

DT[, .(rollmean(list(V1, V2), c(5, 20)), rollmean(list(V2, V3), c(10, 30)))]

how to call mean and rollmean in same j call
```
DT[, .(rollmean(V1, 3), mean(V1)), by=V2]
```
Usually when using by we aggregate data to smaller number of rows, while rolling functions always returns vector of same length as input. This types of functions in SQL have API in the following style:
```
SELECT AVG(value) OVER (ROWS BETWEEN 99 PRECEDING AND CURRENT ROW)
FROM tablename;
```
You can still combine it with GROUP BY as follows:
```
SELECT AVG(value) OVER (ROWS BETWEEN 99 PRECEDING AND CURRENT ROW)
FROM tablename
GROUP BY group_columns;
```
So in SQL those functions stays in SELECT which refers to j in DT. In DT we could achieve the same with:
```
DT[, rollmean(value, 100)]
DT[, rollmean(value, 100), group_columns]
```
Rolling functions fits into same category of functions as shift which also returns same number of rows as it got on input. Shift in SQL looks like:
```
SELECT LAG(value, 1) OVER ()
FROM tablename;
```
mean and rollmean are not just different functions, they are different categories of functions. One meant to aggregate according to group, another not aggregate at all. This is easily visible in SQL where we don't use GROUP BY for rolling functions type but we do need to use GROUP BY for aggregates like mean (eventually getting grant total when grouping clause is not present). I don't see strong reasoning to try to apply same optimizations rules as we do for mean, especially when it doesn't really fit to use case, and all that just for the sake of data.table-y spirit. Current proposal is data.table-y spirit too, it can easily combined with :=, same as shift. It just adds set of new functions, currently not available in data.table.

st-pasha commented 6 years ago

@jangorecki Thanks, these are all valid considerations. Of course different people have different experiences, and different views as to what should be considered "natural".

It is possible to perform rollmean by group: this is just a 2-level grouping: DT[, mean(V1), by=.(V2, roll(3))]. However I don't see how to make different window sizes on different columns with my syntax...

I must admit I never seen SQL syntax for rolling joins before. It's interesting that they use standard aggregator such as AVG yet apply the windowing specification to it. Looking at the Transact-SQL documentation there are some interesting ideas there, for example the distinction between logical/physical row selection. They do allow different "OVER" operators on different columns, however in all examples they give, it is the same OVER clause repeated multiple times. So it suggests that this use-case is much more common, and hence using a single roll() group would result in less repetition.

Also, this SO question provides an interesting insight why the OVER syntax was introduced in SQL at all:

You can use GROUP BY SalesOrderID. The difference is, with GROUP BY you can only have the aggregated values for the columns that are not included in GROUP BY. In contrast, using windowed aggregate functions instead of GROUP BY, you can retrieve both aggregated and non-aggregated values. That is, although you are not doing that in your example query, you could retrieve both individual OrderQty values and their sums, counts, averages etc. over groups of same SalesOrderIDs.

So it appears that the syntax is designed to circumvent the limitation of standard SQL where group-by results could not be combined with unaggregated values (i.e. selecting both A and mean(A) in the same expression). However data.table does not have such a limitation, so it has more freedom in its choice of syntax.

Now, if we want to really get ahead of the curve, we need to think in a broader perspective: what are the "rolling" functions, what are they used for, how they can be extended, etc. Here's my take this, coming from a statistician's point-of-view:

"Rolling mean" function is used to smooth some noisy input. Say, if you have observations over time and you want to have some notion of "average quantity", which would nevertheless vary over time although very slowly. In this case "rolling mean over last 100 observations" or "rolling mean over all previous observations" can be considered. Similarly, if you observe certain quantity over a range of inputs, you may smooth it out by applying "rolling mean over ±50 observations".

So, the first extension is to look at "smooth windows": imagine a mean over past observations where the further an observation in the past, the less its contribution is. Or an average of nearby observations over a Gaussian kernel.
Second are adaptive windows. For example, if you have a noisy input defined over an interval [0, 1], then smoothing it using a ±N window produces a biased result near the edges. An unbiased estimator would adapt the shape of the window based on the distance from the edge.
Resample smoothing: the restriction to produce exactly as many observations as in the source data is too limiting. If you think of your data as a noisy observations of some underlying function, then it is perfectly reasonable to ask to compute the smoothed values of that function on a mesh that is coarser / finer than the original input. Or perhaps the original data is spaced irregularly and you want to resample it onto a regular grid.
Jackknife: for each observation you want to compute average/regression over all observations except the current.
K-fold split: view data as multiple groups where each group excludes only a small portion of the observations.

All of these can be implemented as extended grouping operators, with rolling windows being just one of the elements on this list. That being said, I don't why we can't have it both ways.

jangorecki commented 6 years ago

I must admit I never seen SQL syntax for rolling joins before.

I assume you mean rolling functions, issue has nothing to do with rolling joins.

They do allow different "OVER" operators on different columns, however in all examples they give, it is the same OVER clause repeated multiple times. So it suggests that this use-case is much more common, and hence using a single roll() group would result in less repetition.

It is just a matter of use case, if you are calling same OVER() many time you may find it more performant to use GROUP BY, build lookup table and re-use in other queries. Whatever examples are there, repeating OVER() is required to retain the feature of locality for each measure provided. My uses cases from Data Warehouses where not as simple as those from Microsoft docs.

In contrast, using windowed aggregate functions instead of GROUP BY, you can retrieve both aggregated and non-aggregated values.

In data.table we do := and by in one query to achieve it.

So it appears that the syntax is designed to circumvent the limitation of standard SQL where group-by results could not be combined with unaggregated values (i.e. selecting both A and mean(A) in the same expression). However data.table does not have such a limitation, so it has more freedom in its choice of syntax.

It isn't much limitation of SQL but just design of GROUP BY, that it will aggregate, the same way that our by aggregates. New API was required to cover new window functionalities. Grouping for SQL window function can be provided for each function call using FUN() OVER (PARTITION BY ...) where partition by is like local grouping for single measure. So to achieve flexibility of SQL we would need to use j = mean(V1, roll=5) or j = over(mean(V1), roll=5) keeping that API in j. Still this approach will not allow to support all use cases mentioned above.

you may smooth it out by applying "rolling mean over ±50 observations".

This is what align argument is used for.

So, the first extension is to look at "smooth windows": imagine a mean over past observations where the further an observation in the past, the less its contribution is. Or an average of nearby observations over a Gaussian kernel.

There are many variants (virtually unlimited number) of moving averages, the most common smoothing window function (other than rollmean/SMA) is exponential moving average (EMA). Which should be included, and which not, is not trivial to decide, and actually best to make that decision according to feature requests that will come from users, so far none like this was requested.

All of these can be implemented as extended grouping operators, with rolling windows being just one of the elements on this list.

Surely they can, but if you will look at SO, and issues created in our repo, you will see that those few rolling functions here are responsible for 95+% of requests from users. I am happy to work on EMA and other MAs (although I am not sure if data.table is best place for those), but as a separate issue. Some users, me included, are waiting for just simple moving average in data.table for 4 years already.

Here's my take this, coming from a statistician's point-of-view

My point-of-view comes from Data Warehousing (where I used window function, at least once a week) and price trend analysis (where I used tens of different moving averages).

jangorecki commented 6 years ago

rollmean draft is pushed to roll branch. I found most of other packages that implements rolling mean are not able to dealt well with na.rm=FALSE and NAs present in input. This implementation handles NA consistently to mean, which impose some extra overhead because of ISNAN calls. We could allow API to faster but less safe version if user is sure there are no NAs in input. PR is in #2795

jangorecki commented 6 years ago

@mattdowle answering questions from PR

Why are we doing this inside data.table? Why are we integrating it instead of contributing to existing packages and using them from data.table?

There were 3 different issues created asking for that functionality in data.table. Also multiple SO questions tagged data.table. Users expects that to be in scope of data.table.
data.table fits perfectly for time-series data and rolling aggregates are pretty useful statistic there.

my guess is it comes down to syntax (features only possible or convenient if built into data.table; e.g. inside [...] and optimized) and building data.table internals into the rolling function at C level; e.g. froll* should be aware and use data.table indices and key. If so, more specifics on that are needed; e.g. a simple short example.

For me personally it is about speed and lack of chain of dependencies, nowadays not easy to achieve. Key/indices could be useful for frollmin/frollmax, but it is unlikely that user will create index on measure variable. It is unlikely that user will make index on measure variable, also we haven't made this optimization for min/max yet. I don't see much sense for GForce optimization because allocated memory is not released after roll* call but returned as answer (as opposed to non-rolling mean, sum, etc.).

If there is no convincing argument for integrating, then we should contribute to the other packages instead.

I listed some above, if you are not convinced I recommend you to fill a question to data.table users, ask on twitter, etc. to check response. This feature was long time requested and by many users. If response won't convince you then you can close this issue.

harryprince commented 6 years ago

I found sparklyr can support rolling functions very well in a very large scale dataset.

jangorecki commented 6 years ago

@harryprince could put a little bit more light by providing example code of how you do it in sparklyr? According to "Window functions" dplyr vignette

Rolling aggregates operate in a fixed width window. You won’t find them in base R or in dplyr, but there are many implementations in other packages, such as RcppRoll.

AFAIU you use custom spark API via sparklyr for which dplyr interface is not implemented, correct?

This issue is about rolling aggregates, other "types" of window functions are already in data.table for a long time.

MichaelChirico commented 6 years ago

Providing some example so we can compare (in-memory) performance vs sparklyr/SparkR would also be helpful.

st-pasha commented 6 years ago

It just occurred to me that this question:

how to calculate different window sizes for different columns?

has in fact a broader scope, and does not apply to rolling functions only.

For example, it seems to be perfectly reasonable to ask how to select the average product price by date, and then by week, and then maybe by week+category -- all within the same query. If we ever to implement such functionality, the natural syntax for it could be

DT[, .( mean(price, by=date), 
        mean(price, by=week), 
        mean(price, by=c(week, category)) )]

Now, if this functionality was already implemented, then it would have been a simple leap from there to rolling means:

DT[, .( mean(price, roll=5), 
        mean(price, roll=20), 
        mean(price, roll=100) )]

Not saying that this is unequivocally better than rollmean(price, 5) -- just throwing in some alternatives to consider...

jangorecki commented 6 years ago

@st-pasha

how to select the average product price by date, and then by week, and then maybe by week+category -- all within the same query.

AFAIU this is already possible using ?groupingsets, but not hooked into [.data.table yet.

msummersgill commented 5 years ago

@jangorecki , @st-pasha , and Co. -- Thanks for all your work on this! I'm curious why partial window support was removed from the scope, is there any potential for that functionality to make it back on the roadmap? Would come in handy for me sometimes, and fill in a functionality gap that to my knowledge hasn't been filled in either zoo or RcppRoll.

This Stack Overflow Question is a good example of a rolling application that could benefit from a partial = TRUE argument.

jangorecki commented 5 years ago

@msummersgill Thanks for feedback. In the first post I explicitly linked commit sha where partial window feature code can be found. The implementation that is there was later removed to reduce complexity of code. It was also imposing small performance cost even when not using that feature. This feature can (and probably should) be implemented the other way, first complete as is, and then just fill up the missing partial window using extra loop of 1:window_size. So the overhead of that feature is only noticeable when you use it. Nevertheless we do provide that functionality via adaptive argument, where partial feature is just a special case of adaptive rolling mean. There is example how to achieve partial using adaptive in ?froll manual. Pasting it here:

d = as.data.table(list(1:6/2, 3:8/4))
an = function(n, len) c(seq.int(n), rep(n, len-n))
n = an(3, nrow(d))
frollmean(d, n, adaptive=TRUE)

Of course it will not be as efficient as non-adaptive rolling function using extra loop to fill up just partial window. AFAIK zoo has partial feature.

waynelapierre commented 5 years ago

Do you guys have any plan of adding rolling regression functions to data.table?

jangorecki commented 5 years ago

@waynelapierre if there will be a demand for that, then yes. You have my +1

randomgambit commented 5 years ago

thanks this is great. Just one question though. I only see simple rolling aggregates, like a rolling mean or rolling median. Are you also implementing more refined rolling functions such as rolling DT dataframes? Say, create a rolling DT using the last 10 obs and run a lm regression on it.

Thanks!

jangorecki commented 5 years ago

@randomgambit I would say it is out of scope, unless there will be high demand for that. It wouldn't be very difficult to do it to be faster than base R/zoo just by handling nested loop in C. But we should try to implement it using "online" algorithm, to avoid nested loop. This is more tricky, and we could eventually do it for any statistic, so we have to cut off those statistics at some point.

randomgambit commented 5 years ago

@jangorecki interesting thanks. That means I will keep using tsibble to embed... DATA.TABLES in a tibble! mind blown :D

MichaelChirico commented 5 years ago

Tried to use frollmean to calculate a nonparametric "logistic curve" which shows P[y | x] for binary y using nearest neighbors of x. Was surprised y stored as logical was not cast automatically to integer:

DT = data.table(x = rnorm(1000), y = runif(1000) > .5)
DT[order(x), .(x, p_y = frollmean(y, 50L))]

Error in froll(fun = "mean", x = x, n = n, fill = fill, algo = algo, align = align, : x must be of type numeric

jangorecki commented 5 years ago

An example of how vectorized x/n arguments can impact performance. https://github.com/AdrianAntico/RemixAutoML/commit/d8370712591323be01d0c66f34a70040e2867636#r34769837 less loops, code easier to read, much faster. Code using frollmean in a loop vs passing lists/vectors to frollmean, result 10x-36x speedup.

jangorecki commented 5 years ago

frollapply ready: https://github.com/Rdatatable/data.table/pull/3600

    ### fun             mean     sum  median
    # rollfun          8.815   5.151  60.175
    # zoo::rollapply  34.373  27.837  88.552
    # zoo::roll[fun]   0.215   0.185      NA
    # frollapply       5.404   1.419  56.475
    # froll[fun]       0.003   0.002      NA

jerryfuyu0104 commented 4 years ago

hi guys, will FUN(user defined) passed to frollapply be changed to return an R object or data.frame(data.table), x passed to frollapply could be data.table of character not coerced to numeric, then FUN could do on labels and frollapply return a list? then we can do rolling regression or rolling testing like doing Benford's testing or summary on labels.

jangorecki commented 4 years ago

It is always useful to provide reproducible example. To clarify... in such a scenario you would like to frollapply(dt, 3, FUN) return a list of length nrow(dt) where each list element will be data.table returned by FUN(dt[window])? frollapply(x=dt, n=3, fun=FUN)[[3]] equals to FUN(dt[1:3]) frollapply(x=dt, n=3, FUN=FUN)[[4]] equals to FUN(dt[2:4]) is that correct? @jerryfuyu0104

Currently we support multiple columns passed to first argument but we process them separately, looping. We would probably need some extra argument multi.var=FALSE, when set to true it would not loop over x (as it does now: list(FUN(x[[1]]),FUN(x[[2]]))) but pass all columns FUN(x).

waynelapierre commented 4 years ago

any update for this?

eliocamp commented 4 years ago

I second that previous request.

Furthermore, would it be possible to support a "partial" argument to allow for partial windows?

MichaelChirico commented 4 years ago

@eliocamp could you elaborate on what a partial window is?

jangorecki commented 4 years ago

@eliocamp it would be possible to support "partial" argument. You may know that already but support for this functionality is already there, using adaptive=TRUE argument, see examples for details.

eliocamp commented 4 years ago

It would mean computing the function from the beginning through the end instead than form the half-window point. For example for a rolling mean of 11 width, the first element returned would be the mean of the elements 1 through 6. The second, the mean of the 1st through 7th, and so on.

eliocamp commented 4 years ago

@jangorecki oh, thanks, I didn't know that! I'll check it out.

jangorecki commented 4 years ago

Agree, we need partial argument, not just for convenient but also for speed. adaptive=TRUE adds an overhead. And yes we also need rolling regression, so supplying multiple variables and rolling on them at once, not each one separately. There is no update on the status of those.

eliocamp commented 4 years ago

I'd love to help but my C++ skills are utterly non-existent. :sweat: Do you think it might be suitable for complete newbies?

jangorecki commented 4 years ago

We don't code in C++ but in C. Yes it is good place to start with. I did exactly that on frollmean.

eliocamp commented 4 years ago

I look at the code and it seems daunting. But I'll update you in any case.

But now, for yet another request: frollmean(.SD) should preserve names. More generally, froll* should preserve names if the input is a list-like with names.

ywhcuhk commented 4 years ago

As a frequent user of data.table, I find it extremely useful to have "time aware" features, as those currently offered in the package tsibble. Unfortunately this package is developed around dplyr. I wonder if a data.table implementation could be possible. The window functions proposed in this issue are a subset of those features.

jangorecki commented 4 years ago

@ywhcuhk Thanks for feedback, I was actually thinking this issue was already asking for too much. Most of that is well covered by still lightweight package roll which is very fast. As for the other features, I suggest to create new issue for each feature you are interested in, so discussion whether we want to implement/maintain can be decided for each separately. Just from looking at readme of tstibble I don't see anything new it offers... Its title is "Tidy Temporal Data Frames" but it doesn't even seem to offer temporal joins.

ywhcuhk commented 4 years ago

Thank you @jangorecki for the response. Maybe it's a context dependent issue. The data structure I deal with most frequently is known as "panel data", with an ID and time. If the program is "aware" of this data feature, a lot operations, especially time-series operations, will be made very easy. For someone who knows STATA, it's the operations based on tsset and xtset, such as lead, lag, fill gap, etc. I think the "index" in the data.table can be enhanced in someway to enable such operations.

Of course, these operations can be done in data.table functions like shift and by. I just thought the index in data.table has a lot potential to be explored. I agree this should belong to a different issue. But I don't know how to move it without loosing above discussions ...

AdrianAntico commented 3 years ago

@jangorecki @st-pasha

Hey guys, I'm bringing up a possible feature request. For ML and Forecasting, I use the frollmean and shift functions quite a bit to generate useful features. In a scoring environment I typically only need to generate those rolling stat features for a handful of records from the data.table. I already created some functions for recreating rolling stats on subsets of a data.table using a bunch of lags and rowmean's from outside the data.table package. However, I began testing if I could generate them in faster time using shift and frollmean with a subset in i. When testing it out I realized that I have to include all the rows that need to be used to create the lags and rolling means in order to use the subset in i properly and I'm not sure if that is the intended way to do so.

I have a few examples below where I try to create a lag column and a 2-period moving average for a single record in the data.table. In the examples, I first use the subset in i how I would like to use it, and then show that if I include the other rows used in the lag and rolling mean calc that I get what I want. It would more ideal for me if I only had to specify the rows I want the lags and rolling stats for without having to include the other rows in i.

@st-pasha I included you in this because I know you have frollmean on the roadmap for the python version and you haven't gotten to it yet.

################################################################################
# Create fake data
################################################################################

N = 25116
data <- data.table::data.table(
  DateTime = as.Date(Sys.time()),
  Target = stats::filter(
    rnorm(N, mean = 50, sd = 20),
    filter=rep(1,10),
    circular=TRUE))
data[, temp := seq(1:N)][, DateTime := DateTime - temp]
data <- data[order(DateTime)]

         DateTime   Target  temp
1:     1952-11-20 511.1355 25116
2:     1952-11-21 497.5900 25115
3:     1952-11-22 467.2040 25114
4:     1952-11-23 446.4739 25113
5:     1952-11-24 436.8124 25112
---                          
25112: 2021-08-21 631.6011     5
25113: 2021-08-22 598.5684     4
25114: 2021-08-23 570.2574     3
25115: 2021-08-24 561.8330     2
25116: 2021-08-25 527.9720     1

################################################################################
# Goal: Generate a 1-period lag for a single record in a data.table (temp == 1)
################################################################################

# Shift with i
data[temp %in% c(1), newval := data.table::shift(x = .SD, n = 1, fill = NA, type = 'lag'), .SDcols = "Target"]

         DateTime   Target  temp newval
1:     1952-11-20 511.1355 25116     NA
2:     1952-11-21 497.5900 25115     NA
3:     1952-11-22 467.2040 25114     NA
4:     1952-11-23 446.4739 25113     NA
5:     1952-11-24 436.8124 25112     NA
---                                 
25112: 2021-08-21 631.6011     5     NA
25113: 2021-08-22 598.5684     4     NA
25114: 2021-08-23 570.2574     3     NA
25115: 2021-08-24 561.8330     2     NA
25116: 2021-08-25 527.9720     1     NA

data[temp %in% c(1,2), newval := data.table::shift(x = .SD, n = 1, fill = NA, type = 'lag'), .SDcols = "Target"]

         DateTime   Target  temp  newval
1:     1952-11-20 511.1355 25116      NA
2:     1952-11-21 497.5900 25115      NA
3:     1952-11-22 467.2040 25114      NA
4:     1952-11-23 446.4739 25113      NA
5:     1952-11-24 436.8124 25112      NA
---                                  
25112: 2021-08-21 631.6011     5      NA
25113: 2021-08-22 598.5684     4      NA
25114: 2021-08-23 570.2574     3      NA
25115: 2021-08-24 561.8330     2      NA
25116: 2021-08-25 527.9720     1 561.833

################################################################################
# Goal: Generate a 2-period moving average for a single record in a data.table (temp == 1)
################################################################################

# Create fake data
N = 25116
data <- data.table::data.table(
  DateTime = as.Date(Sys.time()),
  Target = stats::filter(
    rnorm(N, mean = 50, sd = 20),
    filter=rep(1,10),
    circular=TRUE))
data[, temp := seq(1:N)][, DateTime := DateTime - temp]
data <- data[order(DateTime)]

# frollmean with i
data[temp %in% c(1), newval := data.table::frollmean(x = .SD, n = 2), .SDcols = "Target"]

         DateTime   Target  temp newval
1:     1952-11-20 524.4159 25116     NA
2:     1952-11-21 497.6071 25115     NA
3:     1952-11-22 527.2184 25114     NA
4:     1952-11-23 486.7455 25113     NA
5:     1952-11-24 488.6396 25112     NA
---                                 
25112: 2021-08-21 474.2944     5     NA
25113: 2021-08-22 511.5723     4     NA
25114: 2021-08-23 535.1824     3     NA
25115: 2021-08-24 536.3908     2     NA
25116: 2021-08-25 536.3070     1     NA

data[temp %in% c(1,2), newval := data.table::frollmean(x = .SD, n = 2), .SDcols = "Target"]

         DateTime   Target  temp   newval
1:     1952-11-20 524.4159 25116       NA
2:     1952-11-21 497.6071 25115       NA
3:     1952-11-22 527.2184 25114       NA
4:     1952-11-23 486.7455 25113       NA
5:     1952-11-24 488.6396 25112       NA
---                                   
25112: 2021-08-21 474.2944     5       NA
25113: 2021-08-22 511.5723     4       NA
25114: 2021-08-23 535.1824     3       NA
25115: 2021-08-24 536.3908     2       NA
25116: 2021-08-25 536.3070     1 536.3489

jangorecki commented 2 years ago

@ywhcuhk if you are still interested in a feature you were asking for, please do provide minimal example (in a new issue in this repo). To be fair I still don't know what feature are you precisely requesting, maybe #3241 ? I am tidying up this thread by moving requests into the first post and don't know how to handle your.

jangorecki commented 1 year ago

rollcor rollcov rollrank rollunqn rolllm

went out of scope as of current moment. All can work using frollapply (not master branch but PRs), just not super fast. We could consider adding them to scope in future. For the current moment the following set of sum mean prod min max sd var median feels fine and complete to me.

roaldarbol commented 13 hours ago

@jangorecki just following up here based on your comment in {roll}. I was happy to see that frollmedian and friends will be available in {data.table}! What is the status on frollmedian - do you have a rough ETA? I can see that the PR has not been worked in since January and currently fails checks.

jangorecki commented 11 hours ago

No ETA (it requires multiple other branches to be merged first). I recommend to use rollmedian branch directly. It was made on a very stable point in master (cascading through other rolling related branches). I know it is being used in production.

roaldarbol commented 11 hours ago

Sounds good, I'll try that. Which rolling functions are available on that branch? Just frollmedian or also others? (I'm doing some benchmarking, so just want make sure I get as many of your implementations as possible) 😊

jangorecki commented 5 hours ago

Others as well, rollmedian is the most recent branch of all rolling branches so includes the rest as well. There is also rewritten frollapply to apply any function, which is multi threaded and memory optimized.

MichaelChirico commented 4 hours ago

@roaldarbol if you're keen, the blocker for merging existing PR is lack of reviewer+author bandwidth. We could go for someone to either:

Help as reviewer: review to existing (large) PRs, starting from #5441 and including PRs under label:froll except the frollmaxN splits
Help as author: split up #5441 into digestible small PRs. An attempt was made in the frollmaxN splits but ideally we have a chain of PRs like for cbindlist()/mergelist() that is easily digested by reviewer

Rdatatable / data.table

rolling functions, rolling aggregates, sliding window, moving average #2778

rolling functions

features