csgillespie / efficientR

Efficient R programming: a book
https://csgillespie.github.io/efficientR/
Other
721 stars 375 forks source link

1.6.2. - microbenchmark's multcomp reference missing #264

Open engineerchange opened 4 years ago

engineerchange commented 4 years ago

The code snippet in 1.6.2 for microbenchmark shows cld as a column in the output. cld only shows up as a column if multcomp is already installed (see microbenchmark doc).

Earlier in the chapter, it is suggested to install microbenchmark, but not this companion package.

Snippet from 1.6.2:

library("microbenchmark")
df = data.frame(v = 1:4, name = letters[1:4])
microbenchmark(df[3, 2], df[3, "name"], df$name[3])
# Unit: microseconds
#          expr     min    lq  mean median    uq   max neval cld
#      df[3, 2]   17.99 18.96 20.16  19.38 19.77 35.14   100   b
# df[3, "name"]   17.97 19.13 21.45  19.64 20.15 74.00   100   b
#    df$name[3]   12.48 13.81 15.81  14.48 15.14 67.24   100   a 
engineerchange commented 4 years ago

Note the code snippet in 1.6.3.1 doesn't show the microbenchmark output with the cld column.

Robinlovelace commented 4 years ago

I suggest we switch to using the bench package. Sound reasonable @engineerchange and @csgillespie ?

csgillespie commented 4 years ago

@Robinlovelace Switching to bench does involve a fair bit of work, as it's used throughout the book. Also, not entirely convinced that it's "better". Thoughts?

Robinlovelace commented 4 years ago

Thoughts?

Bench tells you about memory use which is important if not vital when fixing critical performance issues. I'm a fan. No need to rush it, interested to hear what others have to say. I know @symbolixAU, for example, has not switched to bench.

I think the +s outweigh the -s, that this is an important but not priority issue. So suggest keeping it open for now.

dcooley commented 4 years ago

I (speaking as @SymbolixAU and myself) dislike bench because the default implementation mixes units in columns

library(bench)

foo <- function() Sys.sleep(5)
bar <- function() Sys.sleep(0.1)

bench::mark(foo(), bar())

# # A tibble: 2 x 13
# expression          min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory           time     gc              
# <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>           <list>   <list>          
#   1 foo()            5s       5s     0.200        0B        0     1     0         5s <NULL> <df[,3] [0 × 3]> <bch:tm> <tibble [1 × 3]>
#   2 bar()         100ms    100ms     9.92         0B        0     5     0      504ms <NULL> <df[,3] [0 × 3]> <bch:tm> <tibble [5 × 3]>
#   

Another example is here

engineerchange commented 4 years ago

I'm but a practitioner, but I agree with @dcooley on the above. Mixing units by default is an odd behaviour, and given that my issue got closed immediately, it doesn't appear like the maintainer(s) will change this behavior in the near future.

engineerchange commented 4 years ago

Perhaps the compromise is to suggest bench as a benchmarking alternative in Chapter 1 when microbenchmark is introduced; I do think memory use is an important feature that shouldn't be overlooked.

Robinlovelace commented 4 years ago

That would be quicker also than changing every instance of benchmarking. It is just a question of setting an argument so if we did switch to the bench in the longer term we could set the time_unit every time, e.g. with:

library(bench)

foo <- function() Sys.sleep(5)
bar <- function() Sys.sleep(0.1)

bench::mark(foo(), bar(), time_unit = "s")
#> # A tibble: 2 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 foo()      5.01   5.01      0.199     2.1KB        0
#> 2 bar()      0.100  0.100     9.97         0B        0

Created on 2020-05-11 by the reprex package (v0.3.0)

It's great to be discussing this, has already made me learn something new, I will use this argument in my use of bench going forward.

jangorecki commented 4 years ago

This doesn't seems to be mentioned so far, but memory tracking in bench is not reliable. It only tracks memory allocated with R's functions (including R's C functions of course). Code is R packages don't have to use those, in fact if you need to allocate within parallel region in your C code you must not use those. That means it is not possible to measure memory with Rprof reliably, which bench use. AFAIR it uses now something more(/else?), not sure how reliable it is. When I had to measure memory usage I used cgmemtime which seems pretty accurate, but is not user friendly. As for mixing time units by default, it can lead to really not easy to spot cases:

  expression                  min
  <bch:expr>             <bch:tm>       
1 CJ.dt(DT1, DT2, DT3)      1.52m
2 CJ.dt_1(DT1, DT2, DT3)    1.53s
3 CJ.dt_2(DT1, DT2, DT3)    1.27s
4 CJDT(DT1, DT2, DT3)       1.07s

another pluses for microbenchmark are that it is very lightweight and its API and codebase is mature and stable.

Robinlovelace commented 4 years ago

Thanks @jangorecki those are good reasons for sticking with the status quo!

Robinlovelace commented 4 years ago

On a related note, I think with dplyr 1.0.0 and recent improvements in data.table it's time to refactor https://csgillespie.github.io/efficientR/data-carpentry.html. Any suggestions on that are very welcome @jangorecki !

jangorecki commented 4 years ago

@Robinlovelace all looks to be up to date. I would only mention how base R changing with square bracket can be used with data.table due to its local scope.

Robinlovelace commented 4 years ago

Thanks @jangorecki, going back to the original issue, the solutions would seem to be keep the cld column or remove it in every benchmark. Which do you recommend?

The code snippet in 1.6.2 for microbenchmark shows cld as a column in the output.

jangorecki commented 4 years ago

I would remove it, to keep things more simple.