Closed rstub closed 5 months ago
Couldn't help but get interested in this.
From reading the code, it seems that the scalar version does calls the param()
method of the Boost distribution class in every iteration. Maybe this is doing some work?
I would have thought that it would be optimized away by the compiler because everything is known at compile time, but perhaps it's not possible given that the objects are global.
In any case, it doesn't seem like R has any involvement here, dqrng::rexp()
is a pure C++ function AFAICT. The exponential.param()
call seems like the only difference.
I have looked into this a little bit myself, before you opened the issue and have done this. This was I naive approach, but maybe it helps.
#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
// [[Rcpp::depends(BH)]]
// [[Rcpp::depends(sitmo)]]
#include <mystdint.h>
#include <Rcpp.h>
#include <dqrng_generator.h>
#include <dqrng_distribution.h>
#include <xoshiro.h>
#include <pcg_random.hpp>
#include <threefry.h>
#include <convert_seed.h>
#include <R_randgen.h>
#include <minimal_int_set.h>
#include <dqrng.h>
// [[Rcpp::export]]
Rcpp::NumericVector rvector(int n) {
Rcpp::NumericVector result = dqrng::dqrexp(n);
return result;
}
// [[Rcpp::export]]
Rcpp::NumericVector rscalar(int n) {
Rcpp::NumericVector result(Rcpp::no_init(n));
for (int i = 0; i < n; ++i)
result[i] = dqrng::rexp();
return result;
}
namespace {
dqrng::rng64_t init() {
Rcpp::RNGScope rngScope;
Rcpp::IntegerVector seed(2, dqrng::R_random_int);
return dqrng::generator(dqrng::convert_seed<uint64_t>(seed));
}
dqrng::rng64_t rng = init();
using generator = double(*)();
dqrng::uniform_distribution uniform{};
generator runif_impl = [] () {return uniform(*rng);};
dqrng::normal_distribution normal{};
generator rnorm_impl = [] () {return normal(*rng);};
dqrng::exponential_distribution exponential{};
generator rexp_impl = [] () {return exponential(*rng);};
}
// [[Rcpp::export]]
Rcpp::NumericVector rscalar_direct(int n) {
Rcpp::NumericVector result(Rcpp::no_init(n));
using parm_t = decltype(exponential)::param_type;
exponential.param(parm_t(1.));
for (int i = 0; i < n; ++i)
result[i] = rexp_impl();
return result;
}
/***R
bench::mark(rvector(1e5),
rscalar(1e5),
rscalar_direct(1e5),
check = FALSE)
*/
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 rvector(1e+05) 653.5µs 1.18ms 918. 784KB 20.9 352 8 384ms <NULL> <df[,3] [2 ×… <bch:t… <tibble [360 …
2 rscalar(1e+05) 16.5ms 16.69ms 54.8 784KB 102. 7 13 128ms <NULL> <df[,3] [2 ×… <bch:t… <tibble [20 ×…
3 rscalar_direct(1e+05) 691.9µs 730.36µs 1334. 784KB 32.7 490 12 367ms <NULL> <df[,3] [2 ×… <bch:t… <tibble [502 …
Somewhere in my naive approach, I must have missed something, because the last function does not react to dqset.seed
, but I am not entirely sure what it is.
OK, I have found my error. The rng
object is not exposed and therefore not reachable by R
. I have written a second function dqset_seed2
with which everything works as expected.
#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
// [[Rcpp::depends(BH)]]
// [[Rcpp::depends(sitmo)]]
#include <mystdint.h>
#include <Rcpp.h>
#include <dqrng_generator.h>
#include <dqrng_distribution.h>
#include <xoshiro.h>
#include <pcg_random.hpp>
#include <threefry.h>
#include <convert_seed.h>
#include <R_randgen.h>
#include <minimal_int_set.h>
#include <dqrng.h>
// [[Rcpp::export(rng = false)]]
Rcpp::NumericVector rvector(int n) {
Rcpp::NumericVector result = dqrng::dqrexp(n);
return result;
}
// [[Rcpp::export(rng = false)]]
Rcpp::NumericVector rscalar(int n) {
Rcpp::NumericVector result(Rcpp::no_init(n));
for (int i = 0; i < n; ++i)
result[i] = dqrng::rexp();
return result;
}
namespace {
dqrng::rng64_t init() {
Rcpp::RNGScope rngScope;
Rcpp::IntegerVector seed(2, dqrng::R_random_int);
return dqrng::generator(dqrng::convert_seed<uint64_t>(seed));
}
dqrng::rng64_t rng = init();
using generator = double(*)();
dqrng::uniform_distribution uniform{};
generator runif_impl = [] () {return uniform(*rng);};
dqrng::normal_distribution normal{};
generator rnorm_impl = [] () {return normal(*rng);};
dqrng::exponential_distribution exponential{};
generator rexp_impl = [] () {return exponential(*rng);};
}
// [[Rcpp::export(rng = false)]]
void dqset_seed2(Rcpp::IntegerVector seed, Rcpp::Nullable<Rcpp::IntegerVector> stream = R_NilValue) {
uint64_t _seed = dqrng::convert_seed<uint64_t>(seed);
if (stream.isNotNull()) {
uint64_t _stream = dqrng::convert_seed<uint64_t>(stream.as());
rng->seed(_seed, _stream);
} else {
rng->seed(_seed);
}
}
// [[Rcpp::export]]
Rcpp::NumericVector rscalar_direct(int n) {
Rcpp::NumericVector result(Rcpp::no_init(n));
using parm_t = decltype(exponential)::param_type;
exponential.param(parm_t(1.));
for (int i = 0; i < n; ++i)
result[i] = rexp_impl();
return result;
}
// [[Rcpp::export]]
Rcpp::NumericVector rscalar_direct2(int n) {
Rcpp::NumericVector result(Rcpp::no_init(n));
for (int i = 0; i < n; ++i) {
using parm_t = decltype(exponential)::param_type;
exponential.param(parm_t(1.));
result[i] = rexp_impl();
}
return result;
}
/***R
bench::mark(rvector(1e5),
rscalar(1e5),
rscalar_direct(1e5),
rscalar_direct2(1e5),
check = FALSE)
*/
# A tibble: 4 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
1 rvector(1e+05) 645.6µs 691.6µs 1335. 781KB 29.2 503 11 377ms <NULL>
2 rscalar(1e+05) 16.1ms 16.3ms 58.4 781KB 32.9 16 9 274ms <NULL>
3 rscalar_direct(1e+05) 690.1µs 738µs 1338. 784KB 38.2 315 9 235ms <NULL>
4 rscalar_direct2(1e+05) 697µs 744.3µs 1334. 784KB 28.2 521 11 391ms <NULL>
# … with 3 more variables: memory <list>, time <list>, gc <list>
@LTLA I also included a version which calls exponential.param()
in every iteration of the loop, but it does not seem to make a real difference.
Here some profiling experiments using https://github.com/nimble-dev/gprofiler.
gprofiler::profile(for (i in 1:100) rvector(1e7))
results in gp27f93c70d213.pdf. Here most of the time is actually spend in the RNG and the distribution function. I am not sure why it is spending so much time in the lambda function, though.
gprofiler::profile(for (i in 1:100) rscalar(1e7))
results in gp27f9992293e.pdf. The time spend in the RNG, the distribution function and the lambda function is similar (even shorter!) than before. However, a lot of time is spent in converting int
and double
to and from Rcpp::IntegerVector
and Rcpp::NumericVector
via Rcpp::as
and Rcpp::wrap
. It might help to use Rcpp::IntegerVector
and Rcpp::NumericVector
to begin with, even if they are only length-1 vectors.
Now that we have dqrng::random_64bit_accessor
(thanks @hsloot!), there is now no real need for these scalar functions:
#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
#include <dqrng.h>
#include <dqrng_distribution.h>
// [[Rcpp::export]]
Rcpp::NumericVector rvector(int n) {
Rcpp::NumericVector result = dqrng::dqrexp(n);
return result;
}
// [[Rcpp::export]]
Rcpp::NumericVector rscalar(int n) {
Rcpp::NumericVector result(Rcpp::no_init(n));
for (int i = 0; i < n; ++i)
result[i] = dqrng::rexp();
return result;
}
// [[Rcpp::export]]
Rcpp::NumericVector rscalar2(int n) {
Rcpp::NumericVector result(Rcpp::no_init(n));
dqrng::random_64bit_accessor rng{};
for (int i = 0; i < n; ++i)
result[i] = rng.variate<dqrng::exponential_distribution>();
return result;
}
/***R
bench::mark(rvector(1),
rscalar(1),
rscalar2(1),
check = FALSE)[,1:6]
*/
Result:
# A tibble: 3 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 rvector(1) 1.3µs 1.97µs 451552. 2.49KB 0
2 rscalar(1) 1.1µs 1.67µs 561314. 2.49KB 56.1
3 rscalar2(1) 1.09µs 1.21µs 699482. 2.49KB 0
And for longer vectors:
# A tibble: 3 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 rvector(1e+05) 3.75ms 4.14ms 243. 784KB 4.11
2 rscalar(1e+05) 25.24ms 26.01ms 38.1 784KB 22.2
3 rscalar2(1e+05) 2.11ms 2.48ms 408. 784KB 6.21
So I am contemplating to deprecate these scalar functions.
The C++ API provides both scalar and vector functions, e.g.
dqrng::rexp
anddqrng::dqrexp
. Using the scalar function within a loop is quite a bit slower than using the vector function:Is this a consequence of many calls via the R API, or is there some inefficiency in the scalar function?