SebKrantz / collapse

Advanced and Fast Data Transformation in R
https://sebkrantz.github.io/collapse/
Other
627 stars 33 forks source link

Pivoting Long to Wide with aggregation #582

Closed AdrianAntico closed 1 month ago

AdrianAntico commented 1 month ago

Is there a way to aggregate the data in a long to wide pivot using different stat aggregate functions? For example, with data.table::dcast there is a function parameter fun.aggregate. I went through the docs but didn't see a way to handle this. From what I can tell, the pivot is taking the last value of the "value" column vs something like mean or sum.

I'm working on benchmarking some dataframe packages, similar to the DuckDB benchmark, except for additional methods. https://github.com/AdrianAntico/Benchmarks

SebKrantz commented 1 month ago

Thanks, yeah this could be added, and with the fast statistical functions it would even be pretty fast. I’ve just not seen the need yet because you can always aggregate beforehand. But you are right that it is possible to save a grouping step and some syntax by adding it in pivot. Let me do this shortly.

SebKrantz commented 1 month ago

Also, in your benchmarks please add sort = FALSE to set_collapse(...) for better performance and comparability.

AdrianAntico commented 1 month ago

@SebKrantz The sort = FALSE made a big difference. I now show collapse being the fastest in aggregation overall. It was the fastest for the 1M and 10M row datasets previously, but now it's the fastest on the those and the 100M and 1B row datasets.

AdrianAntico commented 1 month ago

@SebKrantz for a dcast benchmark, I'll do the aggregation and then the pivot for comparison purposes. I'm looking to use "sum" as the aggregation function for this. If there's a faster way, let me know and I'll update it.

SebKrantz commented 1 month ago

@AdrianAntico I have implemented this now through new FUN and FUN.args arguments. Available in version 2.0.15 through r-universe (once PR #583 is merged). It supports both R-level functions and a selection of hard-coded internal versions. For your benchmarks the internal version FUN = "sum" is appropriate.

AdrianAntico commented 1 month ago

@SebKrantz That's great! I'll make the update and rerun

SebKrantz commented 1 month ago

Regarding the lag benchmark a number of comments are also in order. The first is that with panel data collapse does not provide a 'shifting' lag but a fully identified time-lag (e.g supplying both a g and t argument to flag). It is possible (and efficient), to only supply g, but this requires the data to be consecutive within groups, e.g., g = c(1,1,1,2,2,2) is ok, but g = c(1,2,1,1,2,2) is not ok without a corresponding time variable. This of course generalizes to multivariate groupings.

Then, there are several ways to perform the panel-lag. Using a grouped or indexed data frame created with fgroup_by or findex_by (which creates a panel-data structure where each observation is fully-identified), using the lag operator L or flagitself. Examples:

wlddev |> fgroup_by(iso3c) |> fselect(PCGDP, LIFEEX) |> flag(1:5)
wlddev |> findex_by(iso3c, year) |> fselect(PCGDP, LIFEEX) |> flag(1:5) # findex_by(iso3c) also works
L(wlddev, 1:5, LIFEEX + PCGDP ~ iso3c, ~year) # L(wlddev, 1:5, LIFEEX + PCGDP ~ iso3c) also woks
wlddev |> ftransformv(c(LIFEEX, PCGDP), flag, 1:5, iso3c, year, apply = FALSE) # omitting year also works

Of these options, the latter two are efficient - as they avoid creating grouped data structures. In general, the grouping arguments built into collapse's statistical functions allow grouped-mutation operations on the fly.

For your benchmark example, I would suggest something like

data[, paste0("Lag Daily Liters ", 1L:5L) := flag(list(`Daily Liters`), 1:5, list(Customer,Brand,Category))]
# Same, pure collapse
settfm(data, flag(list(`Daily Liters`), 1:5, list(Customer,Brand,Category)) |> setNames(paste0("Lag Daily Liters ", 1:5)))
# Or simply
L(data, 1:5, `Daily Liters` ~ Customer + Brand + Category)

Note again the caveat that the observations needs to be consecutive within groups for this to work without a time variable.

AdrianAntico commented 1 month ago

Thanks for the support! Also, there is a date variable named Date in the dataset. How would that affect which code set to utilize to make it as fast as possible?

SebKrantz commented 1 month ago

So just using groups, e.g. L(data, 1:5, Daily Liters ~ Customer + Brand + Category) is as fast as possible. But again, only correct if observations are indeed consecutive within groups. If the data is indeed daily, then L(data, 1:5, Daily Liters ~ Customer + Brand + Category, ~ Date) could be used to compute a fully identified lag, i.e., the order of rows in the data does not matter at all anymore for the correctness of the computation. This is then of course doing more than other frameworks and ideally avoided in a benchmark. Another side note here is that the latter would only make sense if the data is indeed daily. If it is monthly you get a lot of gaps in time and NA's. In that case coercion to an appropriate class such as zoo::as.yearmon(Date) is needed to appropriately run time computations on the data. For your case I guess the group lag should be fine if the data can be sorted (e.g. setorder(data, group, Date)) beforehand. I'd say it's difficule to run a fair benchmark here because essentially collapse does something different than other frameworks. It will probably be bloody fast doing just the group lag, but that is because it is not internally 'shifting' the data like e.g. data.table.

AdrianAntico commented 1 month ago

The data is daily, however, not all series are of the same length. Will that affect the output?

Also, I will sort the data before hand and not count it in the benchmark timings.

Lastly, as long as I end up with the same results then the fastest method should be utilized. If it's really fast than so be it.

SebKrantz commented 1 month ago

No, series can have different lengths.

SebKrantz commented 1 month ago

@AdrianAntico regarding the casting benchmark, you can replace

  x = collap(temp, value ~ Date + variable + Customer + Brand + Category, sum)
  x = pivot(data = x, ids = c("Date","Customer","Brand","Category"), values = c("value"), how = "wider")
  roworderv(X = x, cols = c("Date", "Customer","Brand","Category"), decreasing = FALSE)

with

pivot(data = x, ids = c("Date","Customer","Brand","Category"), values = c("value"), how = "wider", FUN = "sum", sort = "ids") 
# Or sort = TRUE if you also want the casted columns to be sorted alphabetically. The default sort = FALSE preserves first-appearance order

after installing collapse 2.0.15 from r-universe using

install.packages("collapse", repos = "https://fastverse.r-universe.dev")

Final comment: if you like to replace sum with fsum in the aggregations with collap (in AggSum_collapse.R), you can remove the mask = "all" setting from set_collapse(). The masking of base-R functions with collapse's optimized versions is done in the DuckDB benchmarks only for presentational purposes - to generate cleaner looking code.

AdrianAntico commented 1 month ago

@SebKrantz I just updated the code and reran the benchmark. The total run time slowed down slightly but is still the fastest by quite a bit...

SebKrantz commented 1 month ago

Thanks. I'll close the issue now. But yeah it's good to see collapse perform so well on a Windows machine. I might make a post about this soon.

AdrianAntico commented 1 month ago

@SebKrantz I'm curious why polars and duckdb aren't performing so well, at least, compared to the duckdb benchmark. I suspect there are additional installation steps I didn't take with duckdb, but there isn't anything on their website explaining how to install on windows with full optimization enabled (assuming that's the case).

Also, I am going to be adding additional operations, such as union, joins, filtering, and whatever else comes to mind or is requested.

AdrianAntico commented 1 month ago

@SebKrantz is there a union method in collapse? I looked through the docs but didn't see anything.

SebKrantz commented 1 month ago

Ok, feel free to ask further questions here. There is no union method, but you could easily program it using the fmatch() function (which also supports data frames), together with ss() and rowbind().

Regarding performance, yes I also suspect that perhaps polars and duckdb are not multithreaded, but perhaps it is also that Windows just slows down threaded execution. Of course you should try to max out those frameworks, but I guess if it doesn’t work for you ‘out of the box’, it won’t work for >80% of others using Windows. It is good to see that collapse, which is much less parallel but has very efficient algorithms, works out of the box and delivers great performance on Windows.

AdrianAntico commented 1 month ago

I've watched Polars in the Task Manager while executing and I see multithreading taking place but not with DuckDB, however. I even tried setting multithreading a few different ways with DuckDB but it just isn't happening. The only recommendation I received was to install Linux and modify some flags, like in the DuckDB benchmark. I want to keep this to a windows / local environment, however, as a lot of users operate with the same conditions.

SebKrantz commented 1 month ago

@AdrianAntico this is a union function that should work with both vectors and data frames with equal columns. It can also trace the origin of rows using idcol = "id" to rowbind().

funion <- function(x, y, ...) {
  ind <- na_rm(fmatch(x, y, overid = 2L))
  if(length(ind) == 0L) return(if(is.atomic(x)) c(x, y) else rowbind(x = x, y = y, ...))
  if(is.atomic(x)) return(c(x, y[-ind]))
  rowbind(x = x, y = ss(y, -ind), ...)
}
SebKrantz commented 1 month ago

And just one more note on the lag benchmark: L() by default keeps ids variables used to group/index the data. If you use cbind(data, L(...)), it you should pass keep.ids = FALSE to L() or replace cbind() with ftransform(). The way I would do this in a single function call though is with ftransformv():

ftransformv(data, c("Daily Liters", "Daily Units", "Daily Margin"), flag, 1:5, list(Customer, Brand, Category), apply = FALSE)

Doesn't matter for pwrformance, just syntax and having a reasonable result I guess.

AdrianAntico commented 1 month ago

I just updated the collapse lag benchmark with your new code. Total runtime decreased to 1/3rd of the previous total run time... Not bad

SebKrantz commented 1 month ago

Oh, that's interesting. I thought L() was quite efficient. I presume this was due to cbind() which can have quite some overhead. collapse has an equivalent function for data frames called add_vars() which has much less overhead.

AdrianAntico commented 1 month ago

I just added the Union benchmark. The collapse timings came is pretty slow but it finished at least. DuckDB again can't finish up on the 100m records dataset. Let me know if you want me to keep the Union operation for collapse. Given that it's not a built-in function I can remove it.

SebKrantz commented 1 month ago

Ok, yeah I guess matching records in order to then remove them from one table is just doing too much... I haven't thought about that muvh, but obviously there are more optimized algorithms at play in other packages. In that case better remove it.

AdrianAntico commented 1 month ago

I'm just doing a Union All operation so I'm not interesting in removing duplicates, at least not for this version of the benchmark. I went ahead and removed it for now

SebKrantz commented 1 month ago

@AdrianAntico Are you going to add the join operation soon? Also let me know when you are confident for this benchmark to be shared.

AdrianAntico commented 1 month ago

@SebKrantz I will probably add a join operation pretty soon but it might not be as comprehensive as the other operations. I just started a new job so my free time has become more scarce. I should be able to get the basic version added in a week.

As for DuckDB, I am able to show that duckdb is aware of all threads when I request it to show me how many threads are in use after setting that value. However, I think there is still something not working properly, which is probably a Windows or R thing, but that's the nature of this benchmark so I believe it's useful to show. If one wants to use DuckDB, perhaps it's best to use a non-Windows or non-R version. I didn't try any of the Tidyverse versions of DuckDB. Maybe they figured out how to get it running properly and I should give them a try. That might take a bit longer than a week or two to test out, however. If you have any thoughts or advice on that let me know.

I also reran Polars from VSCode instead of RStudio to see if there was any difference in run times but there wasn't, so I'm comfortable with the Python results.

SebKrantz commented 1 month ago

Ok, no rush. I guess a left join and inner join should be sufficient for most practical purposes. Happy to share about it once you have them - its good PR for collapse obviously. Just anticipate that people will try to replicate things - the results may surprise some DuckDB users.

AdrianAntico commented 3 weeks ago

@SebKrantz Hey there, I just updated the benchmark. I added the left and inner join operations. Also, I updated the package versions for data.table, polars, and duckdb and reran.

SebKrantz commented 3 weeks ago

Great work @AdrianAntico! I will submit collapse 2.0.15 to CRAN and then post a bit about this benchmark.