Anirban166 / data.table.threads

Other
2 stars 1 forks source link

Creating a speedup plot for different data.table functions that are parallelizable #2

Closed Anirban166 closed 4 months ago

Anirban166 commented 6 months ago

My code so far:

library(data.table)
library(microbenchmark)

run_benchmarks <- function(rowCount, colCount, threadCount) {

  setDTthreads(threadCount)
  dt <- data.table(matrix(runif(rowCount * colCount), nrow = rowCount, ncol = colCount))
  threadLabel <- ifelse(threadCount == 1, "thread", "threads")
  cat(sprintf("Running benchmarks with %d %s, %d rows, and %d columns.\n", getDTthreads(), threadLabel, rowCount, colCount))

  benchmarks <- microbenchmark(
    forder = setorder(dt, V1),
    GForce_sum = dt[, .(sum(V1))],
    subsetting = dt[dt[[1]] > 0.5, ],
    frollmean = frollmean(dt[[1]], 10),
    fcoalesce = fcoalesce(dt[[1]], dt[[2]]),
    between = dt[dt[[1]] %between% c(0.4, 0.6)],    
    fifelse = fifelse(dt[[1]] > 0.5, dt[[1]], 0),
    nafill = nafill(dt[[1]], type = "const", fill = 0),
    CJ = CJ(sample(rowCount, size = min(rowCount, 5)), sample(colCount, size = min(colCount, 5))),
    times = 100
  )

  benchmark_summary <- summary(benchmarks)
  meanTime <- benchmark_summary$mean
  names(meanTime) <- benchmark_summary$expr

  return(data.frame(threadCount = threadCount, expr = names(meanTime), meanTime = meanTime))
}

find_optimal_threads <- function(rowCount, colCount) {

  setDTthreads(0)
  maxThreads <- getDTthreads()
  results <- list()

  for (threadCount in 1:maxThreads) {
    results[[threadCount]] <- run_benchmarks(rowCount, colCount, threadCount)
  }

  return(do.call(rbind, results))
}

benchmarkData <- find_optimal_threads(10000000, 10)
rownames(benchmarkData) <- NULL
benchmarkData$speedup <- benchmarkData$meanTime[benchmarkData$threadCount == 1] / benchmarkData$meanTime
idealSpeedup <- seq(1, getDTthreads())

library(ggplot2)
library(gridExtra)

plots <- lapply(unique(benchmarkData$expr), function(func) {
  data <- benchmarkData[benchmarkData$expr == func, ]
  ggplot(data, aes(x = threadCount, y = speedup)) +
    geom_line() +
    geom_line(aes(x = idealSpeedup, y = idealSpeedup), linetype = "dashed", color = "red") +
    labs(title = paste(func), x = "Threads", y = "Speedup") +
    theme(plot.title = element_text(hjust = 0.5)) +
    scale_x_continuous(breaks = 1:getDTthreads(), labels = 1:getDTthreads())  # To avoid the default numbering that includes 2.5 and 7.5 (half a thread doesn't make sense)
})

grid.arrange(grobs = plots)
image
Benchmarked data used to generate the above ```r > benchmarkData threadCount expr meanTime speedup 1 1 forder 240.357339 1.0000000 2 1 GForce_sum 15.796816 1.0000000 3 1 subsetting 79.248770 1.0000000 4 1 frollmean 24.982181 1.0000000 5 1 fcoalesce 14.946403 1.0000000 6 1 between 47.963492 1.0000000 7 1 fifelse 33.290711 1.0000000 8 1 nafill 8.625862 1.0000000 9 1 CJ 4.210141 1.0000000 10 2 forder 149.965999 1.6027456 11 2 GForce_sum 15.837577 0.9974263 12 2 subsetting 72.494060 1.0931760 13 2 frollmean 25.058196 0.9969665 14 2 fcoalesce 9.449926 1.5816424 15 2 between 37.051850 1.2944966 16 2 fifelse 26.154752 1.2728361 17 2 nafill 9.650479 0.8938274 18 2 CJ 5.201596 0.8093941 19 3 forder 123.404270 1.9477230 20 3 GForce_sum 15.772685 1.0015299 21 3 subsetting 63.440701 1.2491787 22 3 frollmean 25.129612 0.9941332 23 3 fcoalesce 10.366038 1.4418626 24 3 between 31.575813 1.5189947 25 3 fifelse 23.750762 1.4016691 26 3 nafill 8.520967 1.0123103 27 3 CJ 4.402384 0.9563319 28 4 forder 106.497652 2.2569262 29 4 GForce_sum 15.738023 1.0037357 30 4 subsetting 60.926726 1.3007226 31 4 frollmean 24.438927 1.0222290 32 4 fcoalesce 9.059327 1.6498359 33 4 between 30.030850 1.5971407 34 4 fifelse 23.168890 1.4368712 35 4 nafill 8.271142 1.0428866 36 4 CJ 3.902546 1.0788191 37 5 forder 97.492574 2.4653912 38 5 GForce_sum 15.749254 1.0030200 39 5 subsetting 59.008015 1.3430171 40 5 frollmean 24.428953 1.0226464 41 5 fcoalesce 9.449219 1.5817606 42 5 between 26.733613 1.7941268 43 5 fifelse 22.250689 1.4961654 44 5 nafill 8.682627 0.9934623 45 5 CJ 4.796245 0.8777994 46 6 forder 94.746236 2.5368537 47 6 GForce_sum 15.791068 1.0003640 48 6 subsetting 56.711824 1.3973941 49 6 frollmean 24.748496 1.0094424 50 6 fcoalesce 9.625439 1.5528022 51 6 between 27.277986 1.7583224 52 6 fifelse 22.193039 1.5000519 53 6 nafill 9.374746 0.9201169 54 6 CJ 4.361114 0.9653819 55 7 forder 91.377914 2.6303658 56 7 GForce_sum 15.774722 1.0014006 57 7 subsetting 58.802314 1.3477152 58 7 frollmean 24.836142 1.0058801 59 7 fcoalesce 9.519909 1.5700153 60 7 between 25.251394 1.8994394 61 7 fifelse 21.636933 1.5386058 62 7 nafill 9.810404 0.8792566 63 7 CJ 4.564388 0.9223889 64 8 forder 91.423351 2.6290585 65 8 GForce_sum 15.797678 0.9999454 66 8 subsetting 57.267878 1.3838258 67 8 frollmean 24.998616 0.9993426 68 8 fcoalesce 9.373948 1.5944619 69 8 between 27.136203 1.7675093 70 8 fifelse 21.363143 1.5583246 71 8 nafill 10.077464 0.8559557 72 8 CJ 5.699883 0.7386363 73 9 forder 87.352598 2.7515763 74 9 GForce_sum 15.815176 0.9988391 75 9 subsetting 62.297943 1.2720929 76 9 frollmean 24.898455 1.0033627 77 9 fcoalesce 9.577203 1.5606230 78 9 between 26.234959 1.8282282 79 9 fifelse 22.128786 1.5044075 80 9 nafill 9.492020 0.9087489 81 9 CJ 4.052693 1.0388501 82 10 forder 91.192193 2.6357228 83 10 GForce_sum 15.878428 0.9948602 84 10 subsetting 64.146062 1.2354425 85 10 frollmean 25.933211 0.9633277 86 10 fcoalesce 9.496062 1.5739580 87 10 between 27.534979 1.7419113 88 10 fifelse 22.214027 1.4986347 89 10 nafill 9.303581 0.9271551 90 10 CJ 4.239453 0.9930859 ```
tdhock commented 6 months ago

looks good can you please add dot(s) for the max speedup / min time? (is that the same?)

tdhock commented 6 months ago

also please use facet_grid isntead of grid.arrange (which repeats axes, pontentially confusing)

tdhock commented 6 months ago

also please commit this code instead of just pasting it in the issue

Anirban166 commented 6 months ago

looks good can you please add dot(s) for the max speedup / min time? (is that the same?)

also please use facet_grid isntead of grid.arrange (which repeats axes, pontentially confusing)

Done! (please check and yup, the points representing maximum speedup and minimum runtime would be the same here)

library(ggplot2)
library(data.table)
library(microbenchmark)

run_benchmarks <- function(rowCount, colCount, threadCount) {

  setDTthreads(threadCount)
  dt <- data.table(matrix(runif(rowCount * colCount), nrow = rowCount, ncol = colCount))
  threadLabel <- ifelse(threadCount == 1, "thread", "threads")
  cat(sprintf("Running benchmarks with %d %s, %d rows, and %d columns.\n", getDTthreads(), threadLabel, rowCount, colCount))

  benchmarks <- microbenchmark(
    forder = setorder(dt, V1),
    GForce_sum = dt[, .(sum(V1))],
    subsetting = dt[dt[[1]] > 0.5, ],
    frollmean = frollmean(dt[[1]], 10),
    fcoalesce = fcoalesce(dt[[1]], dt[[2]]),
    between = dt[dt[[1]] %between% c(0.4, 0.6)],    
    fifelse = fifelse(dt[[1]] > 0.5, dt[[1]], 0),
    nafill = nafill(dt[[1]], type = "const", fill = 0),
    CJ = CJ(sample(rowCount, size = min(rowCount, 5)), sample(colCount, size = min(colCount, 5))),
    times = 100
  )

  benchmark_summary <- summary(benchmarks)
  meanTime <- benchmark_summary$mean
  names(meanTime) <- benchmark_summary$expr

  return(data.frame(threadCount = threadCount, expr = names(meanTime), meanTime = meanTime))
}

find_optimal_threads <- function(rowCount, colCount) {

  setDTthreads(0)
  maxThreads <- getDTthreads()
  results <- list()

  for (threadCount in 1:maxThreads) {
    results[[threadCount]] <- run_benchmarks(rowCount, colCount, threadCount)
  }

  return(do.call(rbind, results))
}

benchmarkData <- find_optimal_threads(10000000, 10)

rownames(benchmarkData) <- NULL
benchmarkData$speedup <- benchmarkData$meanTime[benchmarkData$threadCount == 1] / benchmarkData$meanTime

idealSpeedup <- seq(1, getDTthreads())
setDT(benchmarkData)
maxSpeedup <- benchmarkData[, .(threadCount = threadCount[which.max(speedup)], speedup = max(speedup)), by = expr]

# Alternatively, I could also calculate the minimum runtime (mean value across the hundred runs) for each routine:
# minMeanTime <- benchmarkData[, .(threadCount = threadCount[which.min(meanTime)], speedup = min(meanTime)), by = expr]
# (Using this instead of maxSpeedup below gives me the same points)

ggplot(benchmarkData, aes(x = threadCount, y = speedup, color = expr)) +
  geom_line() +
  geom_line(data = data.frame(threadCount = 1:getDTthreads(), speedup = idealSpeedup), aes(x = threadCount, y = speedup), linetype = "dashed", color = "red") +
  geom_point(data = maxSpeedup, aes(x = threadCount, y = speedup), color = "black", size = 2) +
  facet_grid(. ~ expr, scales = "free_y") +
  labs(x = "Threads", y = "Speedup", title = "data.table functions") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(breaks = 1:getDTthreads(), labels = 1:getDTthreads())
image
Data ```r > benchmarkData threadCount expr meanTime speedup 1: 1 forder 239.354839 1.0000000 2: 1 GForce_sum 15.737729 1.0000000 3: 1 subsetting 78.606425 1.0000000 4: 1 frollmean 24.798036 1.0000000 5: 1 fcoalesce 11.695457 1.0000000 6: 1 between 45.800603 1.0000000 7: 1 fifelse 34.283160 1.0000000 8: 1 nafill 8.230944 1.0000000 9: 1 CJ 4.473289 1.0000000 10: 2 forder 149.333112 1.6028250 11: 2 GForce_sum 15.724584 1.0008360 12: 2 subsetting 68.551312 1.1466801 13: 2 frollmean 26.394795 0.9395048 14: 2 fcoalesce 9.929182 1.1778873 15: 2 between 38.596730 1.1866446 16: 2 fifelse 26.295190 1.3037807 17: 2 nafill 9.160232 0.8985518 18: 2 CJ 4.410508 1.0142344 19: 3 forder 121.860596 1.9641693 20: 3 GForce_sum 15.706530 1.0019864 21: 3 subsetting 66.665312 1.1791203 22: 3 frollmean 25.381558 0.9770100 23: 3 fcoalesce 8.849993 1.3215216 24: 3 between 31.919748 1.4348673 25: 3 fifelse 23.576877 1.4541010 26: 3 nafill 8.974718 0.9171256 27: 3 CJ 5.169210 0.8653718 28: 4 forder 107.480455 2.2269615 29: 4 GForce_sum 15.745223 0.9995240 30: 4 subsetting 59.856504 1.3132478 31: 4 frollmean 25.548604 0.9706220 32: 4 fcoalesce 8.969025 1.3039831 33: 4 between 28.477646 1.6083002 34: 4 fifelse 23.177296 1.4791699 35: 4 nafill 8.412125 0.9784619 36: 4 CJ 4.163007 1.0745330 37: 5 forder 97.774584 2.4480272 38: 5 GForce_sum 15.723424 1.0009098 39: 5 subsetting 61.913897 1.2696087 40: 5 frollmean 24.628558 1.0068814 41: 5 fcoalesce 8.747967 1.3369342 42: 5 between 28.814913 1.5894757 43: 5 fifelse 21.952430 1.5617023 44: 5 nafill 9.890113 0.8322395 45: 5 CJ 4.020077 1.1127369 46: 6 forder 92.455106 2.5888764 47: 6 GForce_sum 15.723352 1.0009144 48: 6 subsetting 61.632646 1.2754024 49: 6 frollmean 25.524956 0.9715212 50: 6 fcoalesce 9.488571 1.2325836 51: 6 between 24.167005 1.8951708 52: 6 fifelse 22.566640 1.5191965 53: 6 nafill 8.950767 0.9195796 54: 6 CJ 5.056241 0.8847063 55: 7 forder 90.728102 2.6381555 56: 7 GForce_sum 15.732582 1.0003272 57: 7 subsetting 59.908737 1.3121028 58: 7 frollmean 25.045884 0.9901042 59: 7 fcoalesce 8.278645 1.4127260 60: 7 between 26.287552 1.7422924 61: 7 fifelse 22.492661 1.5241931 62: 7 nafill 9.668969 0.8512741 63: 7 CJ 4.182907 1.0694210 64: 8 forder 90.386785 2.6481176 65: 8 GForce_sum 15.734707 1.0001921 66: 8 subsetting 59.812373 1.3142168 67: 8 frollmean 25.372794 0.9773475 68: 8 fcoalesce 9.480982 1.2335702 69: 8 between 25.410581 1.8024225 70: 8 fifelse 21.459026 1.5976103 71: 8 nafill 8.941098 0.9205742 72: 8 CJ 3.896587 1.1480018 73: 9 forder 92.025344 2.6009665 74: 9 GForce_sum 15.805527 0.9957105 75: 9 subsetting 61.956504 1.2687356 76: 9 frollmean 26.530506 0.9346989 77: 9 fcoalesce 8.899893 1.3141120 78: 9 between 26.942583 1.6999336 79: 9 fifelse 21.213104 1.6161313 80: 9 nafill 9.877380 0.8333125 81: 9 CJ 4.671396 0.9575915 82: 10 forder 89.450967 2.6758217 83: 10 GForce_sum 15.736375 1.0000861 84: 10 subsetting 60.694726 1.2951113 85: 10 frollmean 24.753337 1.0018058 86: 10 fcoalesce 9.931642 1.1775955 87: 10 between 28.335002 1.6163967 88: 10 fifelse 21.519467 1.5931231 89: 10 nafill 8.520536 0.9660124 90: 10 CJ 4.379882 1.0213263 ```

also please commit this code instead of just pasting it in the issue

Done as well but please note that it's not ready yet as a package - I was holding off committing till I fix the issues that are popping up when using devtools::load_all(), but for now I've updated the code (which works fine standalone) and uploaded some R package files/essentials.

tdhock commented 6 months ago

great improvement. color legend is redudunant with panel, right? please remove because redundantly encoded information (data variable with more than one different visual property) is potentially confusing. please add linetype legend so that there are two values, ideal and measured. also please try coord_equal maybe try facet_wrap

looks like speedups are far from ideal so picking the max speedup (black dot) is not very efficient. in addition to black dot, can you please plot a geom_text next to it which tells us how many threads achieved the max?

maybe take a line of slope 0.5 (or user-defined), add that to the plot as a third linetype, and then add another point/text which is the max speedup that is above that line, which we could recommend as a number of threads to use for efficient speedups

Anirban166 commented 6 months ago

great improvement. color legend is redudunant with panel, right? please remove because redundantly encoded information (data variable with more than one different visual property) is potentially confusing. please add linetype legend so that there are two values, ideal and measured. also please try coord_equal maybe try facet_wrap

looks like speedups are far from ideal so picking the max speedup (black dot) is not very efficient. in addition to black dot, can you please plot a geom_text next to it which tells us how many threads achieved the max?

You're right, and done!

maybe take a line of slope 0.5 (or user-defined), add that to the plot as a third linetype, and then add another point/text which is the max speedup that is above that line, which we could recommend as a number of threads to use for efficient speedups

I'm a bit confused here so let's follow up on this tomorrow morning

tdhock commented 5 months ago

can you please have the dot be the max subject to the constraint that it is above the 0.5 slope line?

Anirban166 commented 5 months ago

can you please have the dot be the max subject to the constraint that it is above the 0.5 slope line?

Done! (please check)

It took some time to accurately code the logic for that but I went for extracting the points where 'Measured Speedup' and 'Sub-optimal speedup' lines are visually closest (least deviation on the y-axis/speedup) or intersect (previously used their absolute values since it's not a perfect number match), and then I got the ones where speedup is maximized among them.

Note that the 'Sub-optimal Speedup' line that I'm using is indeed a 0.5 slope line, but it stretches from (1, 1) to (threadCount, threadCount/2) times to match with the plot (just like how thread values always start from 1 and not 0). It is still 0.5 in the sense that for every 1 unit increase in the x-axis the y-axis increases by 0.5 units (got the values through interpolation after extending the geometry to have threadCount number of points instead of just drawing a line between two points).

Here's how the plot looks like from an example run:

benchmarkData <- findOptimalThreadCount(1e7, 10)
plot(benchmarkData)
image

and then add another point/text which is the max speedup that is above that line, which we could recommend as a number of threads to use for efficient speedups

And I'm keeping two dots for what you said above (added a legend to distinguish between them)

Anirban166 commented 5 months ago

One thing I've been pondering since yesterday is if it would be practical to have the user enter the total size of their data or the product of the number of rows and columns instead of separate arguments for both - It would be less ideal in terms of input flexibility, but the benefit in doing so could be that I can then allocate more number of columns for the functions that perform better in terms of the parallel scaling across more columns, and likewise do the same for ones that benefit with more rows in the data.

For instance, functions such as forder do better with increased thread count when the data has more rows (as we are observing), but functions such as frollmean would do better with higher thread count when there are more number of columns in the data (an observation stemming from my past benchmarks).

I'm thinking of a simple manual allocation like I'm currently testing with - like 10 of one parameter, and the other one being the total data size divided by 10 (e.g.: For a data size of 1e7, we can perform benchmarks on 10 rows with 1e6 columns and 1e6 rows with 10 columns for corresponding functions that benefit accordingly from this distribution).

@tdhock what do you think? (also, any thoughts on the allocation of rows and columns if this sounds good? I'm wondering if it would make sense for the user to enter a set of row and column values too)

tdhock commented 5 months ago

does ribbon show mean+/-SD? need a better name for "sub-optimal speedup" maybe "recommended speedup"