Closed AdrianAntico closed 1 month ago
Thanks, yeah this could be added, and with the fast statistical functions it would even be pretty fast. I’ve just not seen the need yet because you can always aggregate beforehand. But you are right that it is possible to save a grouping step and some syntax by adding it in pivot. Let me do this shortly.
Also, in your benchmarks please add sort = FALSE
to set_collapse(...)
for better performance and comparability.
@SebKrantz The sort = FALSE made a big difference. I now show collapse being the fastest in aggregation overall. It was the fastest for the 1M and 10M row datasets previously, but now it's the fastest on the those and the 100M and 1B row datasets.
@SebKrantz for a dcast benchmark, I'll do the aggregation and then the pivot for comparison purposes. I'm looking to use "sum" as the aggregation function for this. If there's a faster way, let me know and I'll update it.
@AdrianAntico I have implemented this now through new FUN
and FUN.args
arguments. Available in version 2.0.15 through r-universe (once PR #583 is merged). It supports both R-level functions and a selection of hard-coded internal versions. For your benchmarks the internal version FUN = "sum"
is appropriate.
@SebKrantz That's great! I'll make the update and rerun
Regarding the lag benchmark a number of comments are also in order. The first is that with panel data collapse does not provide a 'shifting' lag but a fully identified time-lag (e.g supplying both a g
and t
argument to flag
). It is possible (and efficient), to only supply g
, but this requires the data to be consecutive within groups, e.g., g = c(1,1,1,2,2,2)
is ok, but g = c(1,2,1,1,2,2)
is not ok without a corresponding time variable. This of course generalizes to multivariate groupings.
Then, there are several ways to perform the panel-lag. Using a grouped or indexed data frame created with fgroup_by
or findex_by
(which creates a panel-data structure where each observation is fully-identified), using the lag operator L
or flag
itself. Examples:
wlddev |> fgroup_by(iso3c) |> fselect(PCGDP, LIFEEX) |> flag(1:5)
wlddev |> findex_by(iso3c, year) |> fselect(PCGDP, LIFEEX) |> flag(1:5) # findex_by(iso3c) also works
L(wlddev, 1:5, LIFEEX + PCGDP ~ iso3c, ~year) # L(wlddev, 1:5, LIFEEX + PCGDP ~ iso3c) also woks
wlddev |> ftransformv(c(LIFEEX, PCGDP), flag, 1:5, iso3c, year, apply = FALSE) # omitting year also works
Of these options, the latter two are efficient - as they avoid creating grouped data structures. In general, the grouping arguments built into collapse's statistical functions allow grouped-mutation operations on the fly.
For your benchmark example, I would suggest something like
data[, paste0("Lag Daily Liters ", 1L:5L) := flag(list(`Daily Liters`), 1:5, list(Customer,Brand,Category))]
# Same, pure collapse
settfm(data, flag(list(`Daily Liters`), 1:5, list(Customer,Brand,Category)) |> setNames(paste0("Lag Daily Liters ", 1:5)))
# Or simply
L(data, 1:5, `Daily Liters` ~ Customer + Brand + Category)
Note again the caveat that the observations needs to be consecutive within groups for this to work without a time variable.
Thanks for the support! Also, there is a date variable named Date in the dataset. How would that affect which code set to utilize to make it as fast as possible?
So just using groups, e.g. L(data, 1:5, Daily Liters ~ Customer + Brand + Category)
is as fast as possible. But again, only correct if observations are indeed consecutive within groups. If the data is indeed daily, then L(data, 1:5, Daily Liters ~ Customer + Brand + Category, ~ Date)
could be used to compute a fully identified lag, i.e., the order of rows in the data does not matter at all anymore for the correctness of the computation. This is then of course doing more than other frameworks and ideally avoided in a benchmark. Another side note here is that the latter would only make sense if the data is indeed daily. If it is monthly you get a lot of gaps in time and NA's. In that case coercion to an appropriate class such as zoo::as.yearmon(Date)
is needed to appropriately run time computations on the data. For your case I guess the group lag should be fine if the data can be sorted (e.g. setorder(data, group, Date)
) beforehand. I'd say it's difficule to run a fair benchmark here because essentially collapse does something different than other frameworks. It will probably be bloody fast doing just the group lag, but that is because it is not internally 'shifting' the data like e.g. data.table.
The data is daily, however, not all series are of the same length. Will that affect the output?
Also, I will sort the data before hand and not count it in the benchmark timings.
Lastly, as long as I end up with the same results then the fastest method should be utilized. If it's really fast than so be it.
No, series can have different lengths.
@AdrianAntico regarding the casting benchmark, you can replace
x = collap(temp, value ~ Date + variable + Customer + Brand + Category, sum)
x = pivot(data = x, ids = c("Date","Customer","Brand","Category"), values = c("value"), how = "wider")
roworderv(X = x, cols = c("Date", "Customer","Brand","Category"), decreasing = FALSE)
with
pivot(data = x, ids = c("Date","Customer","Brand","Category"), values = c("value"), how = "wider", FUN = "sum", sort = "ids")
# Or sort = TRUE if you also want the casted columns to be sorted alphabetically. The default sort = FALSE preserves first-appearance order
after installing collapse 2.0.15 from r-universe using
install.packages("collapse", repos = "https://fastverse.r-universe.dev")
Final comment: if you like to replace sum
with fsum
in the aggregations with collap
(in AggSum_collapse.R), you can remove the mask = "all"
setting from set_collapse()
. The masking of base-R functions with collapse's optimized versions is done in the DuckDB benchmarks only for presentational purposes - to generate cleaner looking code.
@SebKrantz I just updated the code and reran the benchmark. The total run time slowed down slightly but is still the fastest by quite a bit...
Thanks. I'll close the issue now. But yeah it's good to see collapse perform so well on a Windows machine. I might make a post about this soon.
@SebKrantz I'm curious why polars and duckdb aren't performing so well, at least, compared to the duckdb benchmark. I suspect there are additional installation steps I didn't take with duckdb, but there isn't anything on their website explaining how to install on windows with full optimization enabled (assuming that's the case).
Also, I am going to be adding additional operations, such as union, joins, filtering, and whatever else comes to mind or is requested.
@SebKrantz is there a union method in collapse? I looked through the docs but didn't see anything.
Ok, feel free to ask further questions here. There is no union method, but you could easily program it using the fmatch() function (which also supports data frames), together with ss() and rowbind().
Regarding performance, yes I also suspect that perhaps polars and duckdb are not multithreaded, but perhaps it is also that Windows just slows down threaded execution. Of course you should try to max out those frameworks, but I guess if it doesn’t work for you ‘out of the box’, it won’t work for >80% of others using Windows. It is good to see that collapse, which is much less parallel but has very efficient algorithms, works out of the box and delivers great performance on Windows.
I've watched Polars in the Task Manager while executing and I see multithreading taking place but not with DuckDB, however. I even tried setting multithreading a few different ways with DuckDB but it just isn't happening. The only recommendation I received was to install Linux and modify some flags, like in the DuckDB benchmark. I want to keep this to a windows / local environment, however, as a lot of users operate with the same conditions.
@AdrianAntico this is a union function that should work with both vectors and data frames with equal columns. It can also trace the origin of rows using idcol = "id"
to rowbind()
.
funion <- function(x, y, ...) {
ind <- na_rm(fmatch(x, y, overid = 2L))
if(length(ind) == 0L) return(if(is.atomic(x)) c(x, y) else rowbind(x = x, y = y, ...))
if(is.atomic(x)) return(c(x, y[-ind]))
rowbind(x = x, y = ss(y, -ind), ...)
}
And just one more note on the lag benchmark: L() by default keeps ids variables used to group/index the data. If you use cbind(data, L(...))
, it you should pass keep.ids = FALSE
to L()
or replace cbind()
with ftransform()
. The way I would do this in a single function call though is with ftransformv()
:
ftransformv(data, c("Daily Liters", "Daily Units", "Daily Margin"), flag, 1:5, list(Customer, Brand, Category), apply = FALSE)
Doesn't matter for pwrformance, just syntax and having a reasonable result I guess.
I just updated the collapse lag benchmark with your new code. Total runtime decreased to 1/3rd of the previous total run time... Not bad
Oh, that's interesting. I thought L()
was quite efficient. I presume this was due to cbind()
which can have quite some overhead. collapse has an equivalent function for data frames called add_vars()
which has much less overhead.
I just added the Union benchmark. The collapse timings came is pretty slow but it finished at least. DuckDB again can't finish up on the 100m records dataset. Let me know if you want me to keep the Union operation for collapse. Given that it's not a built-in function I can remove it.
Ok, yeah I guess matching records in order to then remove them from one table is just doing too much... I haven't thought about that muvh, but obviously there are more optimized algorithms at play in other packages. In that case better remove it.
I'm just doing a Union All operation so I'm not interesting in removing duplicates, at least not for this version of the benchmark. I went ahead and removed it for now
@AdrianAntico Are you going to add the join operation soon? Also let me know when you are confident for this benchmark to be shared.
@SebKrantz I will probably add a join operation pretty soon but it might not be as comprehensive as the other operations. I just started a new job so my free time has become more scarce. I should be able to get the basic version added in a week.
As for DuckDB, I am able to show that duckdb is aware of all threads when I request it to show me how many threads are in use after setting that value. However, I think there is still something not working properly, which is probably a Windows or R thing, but that's the nature of this benchmark so I believe it's useful to show. If one wants to use DuckDB, perhaps it's best to use a non-Windows or non-R version. I didn't try any of the Tidyverse versions of DuckDB. Maybe they figured out how to get it running properly and I should give them a try. That might take a bit longer than a week or two to test out, however. If you have any thoughts or advice on that let me know.
I also reran Polars from VSCode instead of RStudio to see if there was any difference in run times but there wasn't, so I'm comfortable with the Python results.
Ok, no rush. I guess a left join and inner join should be sufficient for most practical purposes. Happy to share about it once you have them - its good PR for collapse obviously. Just anticipate that people will try to replicate things - the results may surprise some DuckDB users.
@SebKrantz Hey there, I just updated the benchmark. I added the left and inner join operations. Also, I updated the package versions for data.table, polars, and duckdb and reran.
Great work @AdrianAntico! I will submit collapse 2.0.15 to CRAN and then post a bit about this benchmark.
Is there a way to aggregate the data in a long to wide pivot using different stat aggregate functions? For example, with data.table::dcast there is a function parameter fun.aggregate. I went through the docs but didn't see a way to handle this. From what I can tell, the pivot is taking the last value of the "value" column vs something like mean or sum.
I'm working on benchmarking some dataframe packages, similar to the DuckDB benchmark, except for additional methods. https://github.com/AdrianAntico/Benchmarks