Open engineerchange opened 4 years ago
Note the code snippet in 1.6.3.1 doesn't show the microbenchmark output with the cld
column.
I suggest we switch to using the bench package. Sound reasonable @engineerchange and @csgillespie ?
@Robinlovelace Switching to bench does involve a fair bit of work, as it's used throughout the book. Also, not entirely convinced that it's "better". Thoughts?
Thoughts?
Bench tells you about memory use which is important if not vital when fixing critical performance issues. I'm a fan. No need to rush it, interested to hear what others have to say. I know @symbolixAU, for example, has not switched to bench.
I think the +s outweigh the -s, that this is an important but not priority issue. So suggest keeping it open for now.
I (speaking as @SymbolixAU and myself) dislike bench because the default implementation mixes units in columns
library(bench)
foo <- function() Sys.sleep(5)
bar <- function() Sys.sleep(0.1)
bench::mark(foo(), bar())
# # A tibble: 2 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 foo() 5s 5s 0.200 0B 0 1 0 5s <NULL> <df[,3] [0 × 3]> <bch:tm> <tibble [1 × 3]>
# 2 bar() 100ms 100ms 9.92 0B 0 5 0 504ms <NULL> <df[,3] [0 × 3]> <bch:tm> <tibble [5 × 3]>
#
Another example is here
I'm but a practitioner, but I agree with @dcooley on the above. Mixing units by default is an odd behaviour, and given that my issue got closed immediately, it doesn't appear like the maintainer(s) will change this behavior in the near future.
Perhaps the compromise is to suggest bench as a benchmarking alternative in Chapter 1 when microbenchmark is introduced; I do think memory use is an important feature that shouldn't be overlooked.
That would be quicker also than changing every instance of benchmarking. It is just a question of setting an argument so if we did switch to the bench
in the longer term we could set the time_unit
every time, e.g. with:
library(bench)
foo <- function() Sys.sleep(5)
bar <- function() Sys.sleep(0.1)
bench::mark(foo(), bar(), time_unit = "s")
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl>
#> 1 foo() 5.01 5.01 0.199 2.1KB 0
#> 2 bar() 0.100 0.100 9.97 0B 0
Created on 2020-05-11 by the reprex package (v0.3.0)
It's great to be discussing this, has already made me learn something new, I will use this argument in my use of bench
going forward.
This doesn't seems to be mentioned so far, but memory tracking in bench is not reliable. It only tracks memory allocated with R's functions (including R's C functions of course). Code is R packages don't have to use those, in fact if you need to allocate within parallel region in your C code you must not use those. That means it is not possible to measure memory with Rprof reliably, which bench use. AFAIR it uses now something more(/else?), not sure how reliable it is. When I had to measure memory usage I used cgmemtime which seems pretty accurate, but is not user friendly. As for mixing time units by default, it can lead to really not easy to spot cases:
expression min
<bch:expr> <bch:tm>
1 CJ.dt(DT1, DT2, DT3) 1.52m
2 CJ.dt_1(DT1, DT2, DT3) 1.53s
3 CJ.dt_2(DT1, DT2, DT3) 1.27s
4 CJDT(DT1, DT2, DT3) 1.07s
another pluses for microbenchmark are that it is very lightweight and its API and codebase is mature and stable.
Thanks @jangorecki those are good reasons for sticking with the status quo!
On a related note, I think with dplyr
1.0.0 and recent improvements in data.table
it's time to refactor https://csgillespie.github.io/efficientR/data-carpentry.html. Any suggestions on that are very welcome @jangorecki !
@Robinlovelace all looks to be up to date. I would only mention how base R changing with square bracket can be used with data.table due to its local scope.
Thanks @jangorecki, going back to the original issue, the solutions would seem to be keep the cld column or remove it in every benchmark. Which do you recommend?
The code snippet in 1.6.2 for microbenchmark shows
cld
as a column in the output.
I would remove it, to keep things more simple.
The code snippet in 1.6.2 for microbenchmark shows
cld
as a column in the output. cld only shows up as a column if multcomp is already installed (see microbenchmark doc).Earlier in the chapter, it is suggested to install microbenchmark, but not this companion package.
Snippet from 1.6.2: