Investigate performance of scalar functions

The C++ API provides both scalar and vector functions, e.g. dqrng::rexp and dqrng::dqrexp. Using the scalar function within a loop is quite a bit slower than using the vector function:


#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
#include <dqrng.h>

// [[Rcpp::export]]
Rcpp::NumericVector rvector(int n) {
  Rcpp::NumericVector result = dqrng::dqrexp(n);
  return result;
}

// [[Rcpp::export]]
Rcpp::NumericVector rscalar(int n) {
  Rcpp::NumericVector result(Rcpp::no_init(n));
  for (int i = 0; i < n; ++i)
    result[i] = dqrng::rexp();
  return result;
}

/***R
bench::mark(rvector(1e5),
            rscalar(1e5),
            check = FALSE)
*/

# A tibble: 2 x 13
  expression         min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>     <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 rvector(1e+05) 924.8µs  1.09ms     677.      784KB     11.8   287     5      424ms <NULL>
2 rscalar(1e+05)  13.9ms 14.47ms      65.0     784KB     39.0    15     9      231ms <NULL>
# … with 3 more variables: memory <list>, time <list>, gc <list>

Is this a consequence of many calls via the R API, or is there some inefficiency in the scalar function?

Couldn't help but get interested in this.

From reading the code, it seems that the scalar version does calls the param() method of the Boost distribution class in every iteration. Maybe this is doing some work?

I would have thought that it would be optimized away by the compiler because everything is known at compile time, but perhaps it's not possible given that the objects are global.

In any case, it doesn't seem like R has any involvement here, dqrng::rexp() is a pure C++ function AFAICT. The exponential.param() call seems like the only difference.

I have looked into this a little bit myself, before you opened the issue and have done this. This was I naive approach, but maybe it helps.

#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
// [[Rcpp::depends(BH)]]
// [[Rcpp::depends(sitmo)]]
#include <mystdint.h>
#include <Rcpp.h>
#include <dqrng_generator.h>
#include <dqrng_distribution.h>
#include <xoshiro.h>
#include <pcg_random.hpp>
#include <threefry.h>
#include <convert_seed.h>
#include <R_randgen.h>
#include <minimal_int_set.h>
#include <dqrng.h>

// [[Rcpp::export]]
Rcpp::NumericVector rvector(int n) {
  Rcpp::NumericVector result = dqrng::dqrexp(n);
  return result;
}

// [[Rcpp::export]]
Rcpp::NumericVector rscalar(int n) {
  Rcpp::NumericVector result(Rcpp::no_init(n));
  for (int i = 0; i < n; ++i)
    result[i] = dqrng::rexp();
  return result;
}

namespace {
dqrng::rng64_t init() {
  Rcpp::RNGScope rngScope;
  Rcpp::IntegerVector seed(2, dqrng::R_random_int);
  return dqrng::generator(dqrng::convert_seed<uint64_t>(seed));
}
dqrng::rng64_t rng = init();

using generator = double(*)();
dqrng::uniform_distribution uniform{};
generator runif_impl = [] () {return uniform(*rng);};
dqrng::normal_distribution normal{};
generator rnorm_impl = [] () {return normal(*rng);};
dqrng::exponential_distribution exponential{};
generator rexp_impl = [] () {return exponential(*rng);};
}

// [[Rcpp::export]]
Rcpp::NumericVector rscalar_direct(int n) {
  Rcpp::NumericVector result(Rcpp::no_init(n));
  using parm_t = decltype(exponential)::param_type;
  exponential.param(parm_t(1.));
  for (int i = 0; i < n; ++i)
    result[i] = rexp_impl();
  return result;
}

/***R
bench::mark(rvector(1e5),
            rscalar(1e5),
            rscalar_direct(1e5),
            check = FALSE)
*/

# A tibble: 3 x 13
  expression                 min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory        time    gc            
  <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>        <list>  <list>        
1 rvector(1e+05)         653.5µs   1.18ms     918.      784KB     20.9   352     8      384ms <NULL> <df[,3] [2 ×… <bch:t… <tibble [360 …
2 rscalar(1e+05)          16.5ms  16.69ms      54.8     784KB    102.      7    13      128ms <NULL> <df[,3] [2 ×… <bch:t… <tibble [20 ×…
3 rscalar_direct(1e+05)  691.9µs 730.36µs    1334.      784KB     32.7   490    12      367ms <NULL> <df[,3] [2 ×… <bch:t… <tibble [502 …

Somewhere in my naive approach, I must have missed something, because the last function does not react to dqset.seed, but I am not entirely sure what it is.

OK, I have found my error. The rng object is not exposed and therefore not reachable by R. I have written a second function dqset_seed2 with which everything works as expected.

#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
// [[Rcpp::depends(BH)]]
// [[Rcpp::depends(sitmo)]]
#include <mystdint.h>
#include <Rcpp.h>
#include <dqrng_generator.h>
#include <dqrng_distribution.h>
#include <xoshiro.h>
#include <pcg_random.hpp>
#include <threefry.h>
#include <convert_seed.h>
#include <R_randgen.h>
#include <minimal_int_set.h>
#include <dqrng.h>

// [[Rcpp::export(rng = false)]]
Rcpp::NumericVector rvector(int n) {
  Rcpp::NumericVector result = dqrng::dqrexp(n);
  return result;
}

// [[Rcpp::export(rng = false)]]
Rcpp::NumericVector rscalar(int n) {
  Rcpp::NumericVector result(Rcpp::no_init(n));
  for (int i = 0; i < n; ++i)
    result[i] = dqrng::rexp();
  return result;
}

namespace {
  dqrng::rng64_t init() {
  Rcpp::RNGScope rngScope;
  Rcpp::IntegerVector seed(2, dqrng::R_random_int);
  return dqrng::generator(dqrng::convert_seed<uint64_t>(seed));
}
  dqrng::rng64_t rng = init();

  using generator = double(*)();
  dqrng::uniform_distribution uniform{};
  generator runif_impl = [] () {return uniform(*rng);};
  dqrng::normal_distribution normal{};
  generator rnorm_impl = [] () {return normal(*rng);};
  dqrng::exponential_distribution exponential{};
  generator rexp_impl = [] () {return exponential(*rng);};
}

// [[Rcpp::export(rng = false)]]
void dqset_seed2(Rcpp::IntegerVector seed, Rcpp::Nullable<Rcpp::IntegerVector> stream = R_NilValue) {
  uint64_t _seed = dqrng::convert_seed<uint64_t>(seed);
  if (stream.isNotNull()) {
    uint64_t _stream = dqrng::convert_seed<uint64_t>(stream.as());
    rng->seed(_seed, _stream);
  } else {
    rng->seed(_seed);
  }
}

// [[Rcpp::export]]
Rcpp::NumericVector rscalar_direct(int n) {
  Rcpp::NumericVector result(Rcpp::no_init(n));
  using parm_t = decltype(exponential)::param_type;
  exponential.param(parm_t(1.));
  for (int i = 0; i < n; ++i)
    result[i] = rexp_impl();
  return result;
}

// [[Rcpp::export]]
Rcpp::NumericVector rscalar_direct2(int n) {
  Rcpp::NumericVector result(Rcpp::no_init(n));
  for (int i = 0; i < n; ++i) {
    using parm_t = decltype(exponential)::param_type;
    exponential.param(parm_t(1.));
    result[i] = rexp_impl();
  }
  return result;
}

/***R
bench::mark(rvector(1e5),
            rscalar(1e5),
            rscalar_direct(1e5),
            rscalar_direct2(1e5),
            check = FALSE)
*/

# A tibble: 4 x 13
  expression                 min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>             <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 rvector(1e+05)         645.6µs 691.6µs    1335.      781KB     29.2   503    11      377ms <NULL>
2 rscalar(1e+05)          16.1ms  16.3ms      58.4     781KB     32.9    16     9      274ms <NULL>
3 rscalar_direct(1e+05)  690.1µs   738µs    1338.      784KB     38.2   315     9      235ms <NULL>
4 rscalar_direct2(1e+05)   697µs 744.3µs    1334.      784KB     28.2   521    11      391ms <NULL>
# … with 3 more variables: memory <list>, time <list>, gc <list>

@LTLA I also included a version which calls exponential.param() in every iteration of the loop, but it does not seem to make a real difference.

Here some profiling experiments using https://github.com/nimble-dev/gprofiler.

gprofiler::profile(for (i in 1:100) rvector(1e7))

results in gp27f93c70d213.pdf. Here most of the time is actually spend in the RNG and the distribution function. I am not sure why it is spending so much time in the lambda function, though.

gprofiler::profile(for (i in 1:100) rscalar(1e7))

results in gp27f9992293e.pdf. The time spend in the RNG, the distribution function and the lambda function is similar (even shorter!) than before. However, a lot of time is spent in converting int and double to and from Rcpp::IntegerVector and Rcpp::NumericVector via Rcpp::as and Rcpp::wrap. It might help to use Rcpp::IntegerVector and Rcpp::NumericVector to begin with, even if they are only length-1 vectors.

Now that we have dqrng::random_64bit_accessor (thanks @hsloot!), there is now no real need for these scalar functions:

#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
#include <dqrng.h>
#include <dqrng_distribution.h>

// [[Rcpp::export]]
Rcpp::NumericVector rvector(int n) {
  Rcpp::NumericVector result = dqrng::dqrexp(n);
  return result;
}

// [[Rcpp::export]]
Rcpp::NumericVector rscalar(int n) {
  Rcpp::NumericVector result(Rcpp::no_init(n));
  for (int i = 0; i < n; ++i)
    result[i] = dqrng::rexp();
  return result;
}

// [[Rcpp::export]]
Rcpp::NumericVector rscalar2(int n) {
  Rcpp::NumericVector result(Rcpp::no_init(n));
  dqrng::random_64bit_accessor rng{};
  for (int i = 0; i < n; ++i)
    result[i] = rng.variate<dqrng::exponential_distribution>();
  return result;
}

/***R
bench::mark(rvector(1),
            rscalar(1),
            rscalar2(1),
            check = FALSE)[,1:6]
*/

Result:

# A tibble: 3 × 6
  expression       min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 rvector(1)     1.3µs   1.97µs   451552.    2.49KB      0  
2 rscalar(1)     1.1µs   1.67µs   561314.    2.49KB     56.1
3 rscalar2(1)   1.09µs   1.21µs   699482.    2.49KB      0

And for longer vectors:

# A tibble: 3 × 6
  expression           min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 rvector(1e+05)    3.75ms   4.14ms     243.      784KB     4.11
2 rscalar(1e+05)   25.24ms  26.01ms      38.1     784KB    22.2 
3 rscalar2(1e+05)   2.11ms   2.48ms     408.      784KB     6.21

So I am contemplating to deprecate these scalar functions.

daqana / dqrng

Investigate performance of scalar functions #40