A benchmark to compare parallelization gain in `data.table` routines for more number of rows vs more number of columns in the input data

GitHub Action:

jobs:
  comment:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Set up R
      uses: r-lib/actions/setup-r@v2
      with:
        use-public-rspm: true

    - name: Install dependencies
      run: |
        Rscript -e 'install.packages(c("microbenchmark", "data.table"))'

    - name: Run the benchmark
      run: |
        R -e 'source("script.r")'

R script:

library(data.table)
library(microbenchmark)

run_benchmarks <- function(rowCount, colCount, threadCount) {

  setDTthreads(threadCount)
  dt <- data.table(matrix(runif(rowCount * colCount), nrow = rowCount, ncol = colCount))
  threadLabel <- ifelse(threadCount == 1, "thread", "threads")
  cat(sprintf("\nRunning benchmarks with %d %s, %d rows, and %d columns:\n", getDTthreads(), threadLabel, rowCount, colCount))

  benchmarks <- microbenchmark(
    between = dt[dt[[1]] %between% c(0.4, 0.6)],
    fcoalesce = fcoalesce(dt[[1]], dt[[2]]),
    fifelse = fifelse(dt[[1]] > 0.5, dt[[1]], 0),
    forder = setorder(dt, V1),
    frollmean = frollmean(dt[[1]], 10),
    GForce_sum = dt[, .(sum(V1))],
    nafill = nafill(dt[[1]], type = "const", fill = 0),
    subsetting = dt[dt[[1]] > 0.5, ],
    times = 10
  )

  benchmark_summary <- summary(benchmarks)
  meanTime <- benchmark_summary$mean
  names(meanTime) <- benchmark_summary$expr
  print(meanTime)
}

for(threadCount in c(1, 2)) {
  run_benchmarks(1000000, 10, threadCount)
  run_benchmarks(10, 1000000, threadCount)
}

Notes:

GHA virtual environments provide only two cores at maximum. (I attempted to use more than two threads, but getDTthreads() always returned 2 when threadCount was greater than 1)
I tried smaller sizes (less than a million rows/columns, or 1e-1/2/3 of 1000000), but the results have been consistent so I'm sharing only the one with the largest size I could go for.
10000000 rows work, but that many columns don't: (exceeds the memory limit)

Output:

> source("script.r")

Running benchmarks with 1 thread, 1000000 rows, and 10 columns:
   between  fcoalesce    fifelse     forder  frollmean GForce_sum     nafill 
 11613.454   1746.225   2970.750  43385.108   5263.755   2525.294   1452.997 
subsetting 
 10040.670 

Running benchmarks with 1 thread, 10 rows, and 1000000 columns:
     between    fcoalesce      fifelse       forder    frollmean   GForce_sum 
1466559.1608      40.8461      43.6460   67840.8282     117.6300   76577.4494 
      nafill   subsetting 
     81.2152 1411502.1129 

Running benchmarks with 2 threads, 1000000 rows, and 10 columns:
   between  fcoalesce    fifelse     forder  frollmean GForce_sum     nafill 
 16441.934   1213.507   4450.650  39749.267   4420.730   2921.514   1003.130 
subsetting 
 18015.475 

Running benchmarks with 2 threads, 10 rows, and 1000000 columns:
     between    fcoalesce      fifelse       forder    frollmean   GForce_sum 
1451703.7321      44.0603      34.5324   69205.4992     107.4240   94646.1580 
      nafill   subsetting 
     76.2702 1480119.1212

Output screenshot

Added the remaining one (cross join) to my script:

library(data.table)
library(microbenchmark)

run_benchmarks <- function(rowCount, colCount, threadCount) {

  setDTthreads(threadCount)
  dt <- data.table(matrix(runif(rowCount * colCount), nrow = rowCount, ncol = colCount))
  threadLabel <- ifelse(threadCount == 1, "thread", "threads")
  cat(sprintf("\nRunning benchmarks with %d %s, %d rows, and %d columns:\n", getDTthreads(), threadLabel, rowCount, colCount))

  benchmarks <- microbenchmark(
    forder = setorder(dt, V1),
    GForce_sum = dt[, .(sum(V1))],
    subsetting = dt[dt[[1]] > 0.5, ],
    frollmean = frollmean(dt[[1]], 10),
    fcoalesce = fcoalesce(dt[[1]], dt[[2]]),
    fifelse = fifelse(dt[[1]] > 0.5, dt[[1]], 0),
    between = dt[dt[[1]] %between% c(0.4, 0.6)],
    nafill = nafill(dt[[1]], type = "const", fill = 0),
    CJ = CJ(sample(rowCount, size = min(rowCount, 5)), sample(colCount, size = min(colCount, 5))),
    times = 10
  )

  benchmark_summary <- summary(benchmarks)
  meanTime <- benchmark_summary$mean
  names(meanTime) <- benchmark_summary$expr
  print(meanTime)
}

for(threadCount in c(1, 2)) {
  run_benchmarks(1000000, 10, threadCount)
  run_benchmarks(10, 1000000, threadCount)
}

> source("script.r")

Running benchmarks with 1 thread, 1000000 rows, and 10 columns:
    forder GForce_sum subsetting  frollmean  fcoalesce    fifelse    between 
43531.4184  2456.6052 10675.1655  4447.9154  1287.4682  2910.9895  5561.7506 
    nafill         CJ 
  933.5532   688.1890 

Running benchmarks with 1 thread, 10 rows, and 1000000 columns:
      forder   GForce_sum   subsetting    frollmean    fcoalesce      fifelse 
  72665.2754   76368.2697 1492410.9045      77.5700      42.7188      38.0600 
     between       nafill           CJ 
1399204.1769      55.4876     910.6031 

Running benchmarks with 2 threads, 1000000 rows, and 10 columns:
    forder GForce_sum subsetting  frollmean  fcoalesce    fifelse    between 
40045.5511  2851.9353  7059.6484  4389.8029  7384.8115  3230.8931  6061.1689 
    nafill         CJ 
  967.9503 13262.2634 

Running benchmarks with 2 threads, 10 rows, and 1000000 columns:
      forder   GForce_sum   subsetting    frollmean    fcoalesce      fifelse 
  74896.4874   86001.0177 1400344.0136     102.1589      36.1362      40.1211 
     between       nafill           CJ 
1411831.0238      76.1031    1367.5597

Output screenshot

Based on the results I'm observing, it seems that better speedups can be expected when the input data has more number of:

Rows, when using forder(), GForce functions (such as the mean), subset(), between() (also fread() and fwrite() - not tested here since it's already done)
Columns, as in the case of frollmean(), fcoalesce(), fifelse(), nafill(), CJ()

Since the test code I wrote for fifelse and subset were based on row conditions and more catered towards row-intensive operations, I tried some column-intensive ops.

fifelse was still significantly faster for a large number of rows vs a large number of columns.

subset though, produced near about the same results in both cases, although having more number of rows was still slightly faster than having more number of columns. For reference:

library(data.table)
library(microbenchmark)

run_benchmarks <- function(rowCount, colCount, threadCount) {

  setDTthreads(threadCount)
  dt <- data.table(matrix(runif(rowCount * colCount), nrow = rowCount, ncol = colCount))
  threadLabel <- ifelse(threadCount == 1, "thread", "threads")
  cat(sprintf("\nRunning benchmarks with %d %s, %d rows, and %d columns:\n", getDTthreads(), threadLabel, rowCount, colCount))

  benchmarks <- microbenchmark(
    forder = setorder(dt, V1),
    GForce_sum = dt[, .(sum(V1))],
    subsetting = dt[dt[[1]] > 0.5, ],
    frollmean = frollmean(dt[[1]], 10),
    fcoalesce = fcoalesce(dt[[1]], dt[[2]]),
    fifelse = fifelse(dt[[1]] > 0.5, dt[[1]], 0),
    between = dt[dt[[1]] %between% c(0.4, 0.6)],
    nafill = nafill(dt[[1]], type = "const", fill = 0),
    subsetting_column_intensive = dt[, .SD, .SDcols = 1:min(1000, colCount)],
    CJ = CJ(sample(rowCount, size = min(rowCount, 5)), sample(colCount, size = min(colCount, 5))),
    times = 10
  )

  benchmark_summary <- summary(benchmarks)
  meanTime <- benchmark_summary$mean
  names(meanTime) <- benchmark_summary$expr
  print(meanTime)
}

for(threadCount in c(1, 2)) {
  run_benchmarks(1000000, 10, threadCount)
  run_benchmarks(10, 1000000, threadCount)
}

> source("script.r")

Running benchmarks with 1 thread, 1000000 rows, and 10 columns:
                     forder                  GForce_sum 
                 46112.2772                   2625.1087 
                 subsetting                   frollmean 
                 12797.6722                   5135.8454 
                  fcoalesce                     fifelse 
                  1863.6185                   4253.7439 
                    between                      nafill 
                 14006.8108                   1623.8215 
subsetting_column_intensive                          CJ 
                 12330.3476                    830.7561 

Running benchmarks with 1 thread, 10 rows, and 1000000 columns:
                     forder                  GForce_sum 
                 68828.3094                  71770.7628 
                 subsetting                   frollmean 
               1544882.5921                     99.4936 
                  fcoalesce                     fifelse 
                    29.5189                     32.2261 
                    between                      nafill 
               1585512.0459                     68.5196 
subsetting_column_intensive                          CJ 
                 39634.0007                    919.7032 

Running benchmarks with 2 threads, 1000000 rows, and 10 columns:
                     forder                  GForce_sum 
                  39905.245                    3047.815 
                 subsetting                   frollmean 
                   6981.574                   10511.764 
                  fcoalesce                     fifelse 
                  13535.083                    2902.409 
                    between                      nafill 
                   5513.657                    1069.926 
subsetting_column_intensive                          CJ 
                  32410.043                    1396.480 

Running benchmarks with 2 threads, 10 rows, and 1000000 columns:
                     forder                  GForce_sum 
                 69553.1633                  76141.1931 
                 subsetting                   frollmean 
               1421410.7526                    108.5046 
                  fcoalesce                     fifelse 
                    31.3867                     50.6045 
                    between                      nafill 
               1400176.1321                     99.5726 
subsetting_column_intensive                          CJ 
                 38302.6997                    770.0899

Anirban166 / Autocomment-atime-results

A benchmark to compare parallelization gain in `data.table` routines for more number of rows vs more number of columns in the input data #32