SebKrantz / collapse

Advanced and Fast Data Transformation in R
https://sebkrantz.github.io/collapse/
Other
658 stars 35 forks source link

feedback about JSS/arxiv paper #620

Open tdhock opened 3 months ago

tdhock commented 3 months ago

Hi @SebKrantz I read your paper about collapse, https://arxiv.org/pdf/2403.05038 and here are some comments/questions that are meant to help improve the paper (as in peer review).

Section 9 (Benchmark) does comparative timings, each for a single data size: m <- matrix(rnorm(1e7), ncol = 1000) Whereas this approach is simple to implement, it has two drawbacks (1) it is difficult/impossible to for others to reproduce the exact timings on other CPUs, and (2) it may hide important details about the asymptotic properties of the approach. (asymptotic means when N is very large, does the code take time which is linear, log-linear, quadratic, in N?) I think it would be much more convincing if you did asymptotic analysis, plotting time versus data size N, which resolves both issues. You could use my atime package which makes it easy to do such analyses, for example from https://tdhock.github.io/blog/2024/collapse-reshape/ atime-wide-1 the figure above shows that collapse is faster than alternatives in terms of absolute computation time, but it also shows a difference in slope between the methods, which implies a different complexity class, which is estimated in the figure below: atime-wide-refs-1 the figure above shows that, for the top panels (kilobytes), the black empirical timing lines are aligned with the violet N reference lines, suggesting O(N) memory complexity. For the bottom panels (seconds), the references suggest linear time, O(N), for data.table and log-linear time, O(N log N) for the others, which is a more complete description of the different approaches. Moreover, identifying these asymptotic complexity class differences suggests a clear avenue for improving the performance of the code (change the underlying algorithm to avoid a sort).

Also I see that your paper discusses reshaping in section 5.2 (Pivots), but does not cite my nc paper, which discusses similar approaches to wide-to-long data reshaping, https://journal.r-project.org/archive/2021/RJ-2021-029/index.html including the Table 1 which is reproduced below, showing details about features implemented in different functions. image I think your paper would benefit by adding an analogous table, which would highlight the new features which you propose in collapse, and would explain which of those features are present/absent in previous work.

For example, one of the advantages that I see in collapse, is a single function (pivot) for both longer and wider operations (as well as recast). It would be great to see "recast" as one of the features in your table 1, where you could highlight that it is present in collapse, but not present in tidyr/data.table. Also I believe you should add a citation and comparison with cdata, which is another package implementing recast, see https://cloud.r-project.org/web/packages/cdata/vignettes/general_transform.html and other vignettes.

SebKrantz commented 3 months ago

@tdhock thanks a lot for your feedback and these comparisons. I have gotten some similar comments from JSS reviewers regarding the benchmark section, I.e., that it should accommodate different data sizes. Overall though this section should not become too large, as collapse is already part of two fair benchmarks.

Thanks also for alerting me to your paper!

I just read through your example. Very interesting. Though I have to note that the fastest way to do aggregation pivots with collapse is to use the hard-coded internal functions (”mean” in quotes) instead of mean.

I will see if such detailed exposition can make it into the articley

tdhock commented 2 months ago

Thanks for the info about pivot(FUN="mean") which I confirm is about 2x faster than FUN=mean, that is really impressive performance. atime-agg-1

By the way, I am leading a project about expanding the group of contributors to R data.table, and I would like to invite you to participate. Even if you don't have any knowledge of the internal workings of data.table, I believe your C coding experience, and your focus on efficiency, would be a valuable asset to the data.table community. If you have any time to contribute, a good place to start would be to a read and comment on open issues, https://github.com/Rdatatable/data.table/issues/

Also there is travel funding available, which could pay for you to give a talk related to data.table at any relevant meeting (R/stats related conference, meetup, etc), for more info, please see the application guidelines: https://rdatatable-community.github.io/The-Raft/posts/2023-11-01-travel_grant_announcement-community_team/

SebKrantz commented 2 months ago

@tdhock thanks. 2x? The internal function can be up to 50 times faster, depending on the problem size.

options(fastverse.styling = FALSE)
library(fastverse)
#> -- Attaching packages --------------------------------------- fastverse 0.3.3 --
#> Warning: package 'data.table' was built under R version 4.3.1
#> v data.table 1.15.4     v kit        0.0.17
#> v magrittr   2.0.3      v collapse   2.0.16

DT <- data.table(id = sample.int(1e6, 5e7, TRUE), 
                 variable = sample(letters, 5e7, TRUE), 
                 value = rnorm(5e7))

system.time(pivot(DT, how = "w", FUN = mean))   # Split-apply-combine using base::mean
#>    user  system elapsed 
#>  83.900  10.430  98.158
system.time(pivot(DT, how = "w", FUN = fmean))  # Vectorized across groups with fmean + deep copy
#>    user  system elapsed 
#>   3.843   0.773   5.061
system.time(pivot(DT, how = "w", FUN = "mean")) # Internal: on the fly (1-pass, running mean)
#>    user  system elapsed 
#>   2.089   0.121   2.227
system.time(pivot(DT, how = "w", FUN = "mean")) # To confirm
#>    user  system elapsed 
#>   2.049   0.117   2.220

Created on 2024-09-03 with reprex v2.0.2

A main difference between collapse and data.table is that vectorizations in collapse are always explicit, i.e., there is no internal GeForce mechanism to substitute base R with optimized functions but explicit fast statistical functions, and, in the case of pivot(), internal functions which avoid a further deep copy (inspiration here comes from terra::aggregate).

And thanks a lot for the invite to contribute to data.table. I would say I have a fair understanding of parts of the data.table source code and could of course get engaged, but I don't think I'll be able to afford a lot of time there as collapse is also a very large software project and I have many other commitments.

tdhock commented 2 months ago

Hi Seb thanks for your response. I can confirm the difference between mean/fmean/"mean" on my computer "mean" is actually about 100x faster than mean, for ids=idN as in your example (where the number of output id values scales with the number of input rows N). image My example/blog corresponds to the ids=id2 results in the figure above (where the number of output id values is constant, always 2 ids). In this case the speedup is more modest, 2x.

FUN.list <- list(
  mean=mean,
  fmean=collapse::fmean,
  "'mean'"='mean')
expr.list <- atime::atime_grid(
  list(
    ids=c("idN","id2"),
    FUN=names(FUN.list)),
  pivot={
    collapse::pivot(DT, how = "w", FUN = FUN.list[[FUN]], ids=ids)
  },
  collapse="\n", 
  expr.param.sep="\n")
library(data.table)
a.result <- atime::atime(
  setup={
    set.seed(1)
    DT <- data.table(
      idN = sample.int(N/50, N, TRUE),
      id2 = sample.int(2, N, TRUE), 
      variable = sample(letters, N, TRUE), 
      value = rnorm(N))
  }, 
  expr.list=expr.list)
a.refs <- atime::references_best(a.result)
a.pred <- predict(a.refs)
plot(a.pred)

If you have a fair understanding of parts of the data.table source code, that would be really impactful if you could participate, even a little bit, from time to time. data.table is very widely used (1500+ direct revdeps on CRAN), and there are lots of issues/feature requests, but only a small handful of developers, so we need all the help we can get.

SebKrantz commented 2 months ago

Thanks @tdhock, also for showcasing your atime package. This is useful! And I'll see what I can do regarding data.table contribution.

And just as a side note regarding CRAN dependencies. I think the community and particularly package developers would do very well to also give some more attention to the fastverse project - to which I am also happy to invite some of you guys if interested (and always happy to feature new packages). Tabular data is not the most efficient structure for many statistical operations - one reason I have opted to create a class-agnostic statistical computing architecture in collapse that also supports vectors and matrices. There are many other (smaller) and lightweight packages that are really efficient at doing certain things. In my own smaller packages such as osmclass and dfms I use data.table and collapse (and others) but not for the tabular stuff. For example from data.table I find chmatch/%chin%/set/setnafill/frollmean really useful. Developers that want highly efficient and robust code in my view need to 'think-C', i.e., what's actually going on under the hood, and what do I really need. I find very often the solution is not tabular data manipulation.

tdhock commented 2 months ago

to give more attention to fastverse/collapse project, you may consider submitting it for the Seal of Approval, https://github.com/Rdatatable/data.table/blob/master/Seal_of_Approval.md