extendr / rextendr

An R package that helps scaffolding extendr-enabled packages or compiling Rust code dynamically
https://extendr.github.io/rextendr/
Other
197 stars 27 forks source link

Benchmark Against Rcpp #11

Closed waynelapierre closed 2 years ago

waynelapierre commented 3 years ago

It would be great if there could be a comparison between extender and Rcpp in terms of performance.

clauswilke commented 3 years ago

I'm not sure that's very useful. The results will depend hugely on what exactly you do. In particular, are you calling back from the compiled language to R or not?

In any case, extendr is following the design principle of safety and convenience over performance, so in many applications Rcpp code may be faster, in particular if you're going back-and-forth frequently between R and Rust. On the other hand, for numerically intensive code that does many operations in compiled code, the results are comparable. (In my example below, Rust is slightly faster, but it's basically a wash.)

library("Rcpp")
library("rextendr")

cppFunction("
  double computation_rcpp(unsigned int n = 1000) {
    double x1 = 0.1;
    double x2 = 1.5;
    double z = 1;
    for (unsigned int i = 1; i <= n; i++) {
      z = x2*z - x1;
      if (z > 10) {
        z = z - 9;
      }
    }
    return z;
  }")

rust_function("fn computation_rust(n:u32) -> f64 {
    let x1 = 0.1;
    let x2 = 1.5;
    let mut z = 1.0;
    let mut i = 1;
    loop {
      z = x2*z - x1;
      if z > 10. {
        z = z - 9.;
      }
      i = i + 1;
      if i > n {
        break;
      }
    }
    z
  }", profile = "release")
#> build directory: /var/folders/b1/13gn4j655jddkfhxmtk5tsfm0000gn/T//RtmpdxqKfc/file17cfd6a138c73

computation_rcpp(100000)
#> [1] 5.552521
computation_rust(100000)
#> [1] 5.552521

microbenchmark::microbenchmark(
  computation_rcpp(10),
  computation_rust(10),
  computation_rcpp(1000),
  computation_rust(1000),
  computation_rcpp(100000),
  computation_rust(100000)
)
#> Unit: nanoseconds
#>                     expr    min       lq      mean   median       uq      max
#>     computation_rcpp(10)   1187   1682.5   2886.35   2509.5   3425.5    10110
#>     computation_rust(10)    926   1092.5 191324.87   1332.5   1645.5 18937742
#>   computation_rcpp(1000)   5005   6550.0  20419.67   7069.5   7758.5  1307997
#>   computation_rust(1000)   4874   5552.0   6780.06   6163.0   6737.5    32385
#>  computation_rcpp(1e+05) 392206 433195.0 476588.15 465732.0 527082.0   561250
#>  computation_rust(1e+05) 394237 425857.5 472266.51 463193.5 525485.5   571115
#>  neval cld
#>    100  a 
#>    100  ab
#>    100  a 
#>    100  a 
#>    100   b
#>    100   b

Created on 2021-01-06 by the reprex package (v0.3.0)

clauswilke commented 3 years ago

However, when calling the function many times but with a small computational workload, Rust is almost twice as fast as Rcpp. I hadn't quite expected this. @andy-thomason Looks like starting a new thread for each Rust function is not a big deal.

> microbenchmark::microbenchmark(
+   for (i in 1:10000) { computation_rcpp(1) },
+   for (i in 1:10000) { computation_rust(1) }
+ )
Unit: milliseconds
                                           expr      min       lq     mean   median
 for (i in 1:10000) {     computation_rcpp(1) } 14.04551 20.24895 24.38897 23.81935
 for (i in 1:10000) {     computation_rust(1) } 10.32471 11.97144 13.01768 12.76082
       uq       max neval cld
 25.79039 158.66692   100   b
 13.75396  17.49498   100  a 
andy-thomason commented 3 years ago

This is good news.

It all depends on the target architecture and how vectorisable the loop is.

Godbolt gives:

example::computation_rust:
        vmovsd  xmm0, qword ptr [rip + .LCPI0_0]
        mov     eax, 1
        vmovsd  xmm1, qword ptr [rip + .LCPI0_1]
        vmovsd  xmm2, qword ptr [rip + .LCPI0_2]
        vmovsd  xmm3, qword ptr [rip + .LCPI0_3]
        vmovsd  xmm4, qword ptr [rip + .LCPI0_4]
.LBB0_1:
        vmulsd  xmm0, xmm0, xmm1
        vaddsd  xmm0, xmm0, xmm2
        vaddsd  xmm5, xmm0, xmm3
        vcmpltsd        xmm6, xmm4, xmm0
        vblendvpd       xmm0, xmm0, xmm5, xmm6
        inc     eax
        cmp     eax, edi
        jbe     .LBB0_1
        ret

Which is not vectorised but has at least folded the condition into a vblendpd.

Most of the problem is the running product which, by IEEE rules, prevents the re-ordering of the accumulation of z.

Certainly, LLVM's code generation is streets ahead of GCC's.

I was planning to add fast summation that allows loops to vectorise to RobjItertools. We may have to consider a rayon feature as time goes by.

andy-thomason commented 3 years ago

This is a slightly better version of the loop:

pub fn computation_rust(n:u32) -> f64 {
    let x1 = 0.1;
    let x2 = 1.5;
    let mut z = 1.0;
    for i in 1..n+1 {
      z = x2*z - x1;
      if z > 10. {
        z = z - 9.;
      }
    }
    z
}

It allows the loop to unroll as the trip count is more predictable.

clauswilke commented 3 years ago

I had tried the for loop but on my machine it was slower than the C++ code. :-)

Thell commented 2 years ago

It's been a long while since this post was updated (sorry to necropost) but I'd like to share some findings on the overhead of using Rcpp -vs- Rust for returning a basic zero filled vector and matrix from both...

This gist of an R script compiles the Rust and Rcpp versions as similarly as I could make them and iterates N from 10 to 640 for length on the vectors and each dim 10x10 to 640x640 per matrix.

For example, the simple vector...

IntegerVector rcpp_zeros_intvec(int n) {
    IntegerVector my_vec(n);
    return my_vec;
}
fn rust_zeros_intvec(n: i32) -> Robj {
    let my_vec = vec!(0; n as usize);
    r!(my_vec)
}

It shows how there is a significant negative impact from the extendr handling... It's a slightly noticeable impact at N=10

N =  10 
Unit: microseconds
                        expr   min    lq     mean median    uq      max neval cld
        rcpp_zeros_intvec(N) 1.713 1.923 3.097295  2.104 2.405  739.202  1000   a
        rust_zeros_intvec(N) 1.673 1.913 3.294399  2.054 2.225 1149.976  1000   a
        rcpp_zeros_intmat(N) 1.994 2.254 3.541405  2.455 2.815  752.757  1000   a
        rust_zeros_intmat(N) 3.177 3.697 5.001572  3.988 4.288  811.630  1000   a
 rcpp_zeros_intvec_dimmed(N) 1.984 2.254 3.596928  2.454 2.825  763.548  1000   a

but it doesn't take long for the impact to be very noticable

N =  640 
Unit: microseconds
                        expr      min        lq       mean    median        uq        max neval cld
        rcpp_zeros_intvec(N)    2.044    4.3385   11.40082    7.3290   16.0310    326.103  1000 a  
        rust_zeros_intvec(N)    2.925    6.6530   21.90912   21.7670   29.2960    171.127  1000 a  
        rcpp_zeros_intmat(N)   42.030  174.7990  749.19897  409.9275  513.1500 196941.654  1000  b 
        rust_zeros_intmat(N) 1227.463 1593.6380 2183.07436 1851.1095 2254.8875  21488.223  1000   c
 rcpp_zeros_intvec_dimmed(N)   39.495  178.1705  534.83684  412.4355  530.6185  15211.649  1000  b 

If that's from copies or from handling the owners or whatever I dont have the knowledge to know and so the guys in discord recommended for it to be posted here.

clauswilke commented 2 years ago

Which version of extendr is this using? The one currently released to crates.io is completely out of date. Unfortunately the next one is not quite ready yet for release.

In any case, it would be useful to try again with the latest version and using the new wrappers that have been written. In particular in the vector example, I'm quite certain that the rust code makes an additional copy that the C++ code doesn't make, and I think this copy can be avoided with the development version of extendr.

Matrix code probably needs more work, but it's low priority until the basic features are settled.

Thell commented 2 years ago

Which version of extendr is this using?

The latest via github extendr-api = list(git = "https://github.com/extendr/extendr")

I can see where an additional copy might need to be made considering the 'safety' aspect of ownership but would one expect a copy to incur that much expense? Granted Rcpp is much more mature and has had many iterations of performance improvements over time. I don't have the rust experience to grok what I read without a glossary yet so all I can offer is the results of the exceedingly simple test.

clauswilke commented 2 years ago

Ok. In this case, for the numeric vector, you should try something like the following:

fn rust_zeros_intvec(n: i32) -> Robj {
    let my_vec : Doubles = (0..n).map(|_| (0.).into()).collect();
    my_vec
}

I haven't tested this, so there may be some problem with this code. Example from here: https://github.com/extendr/extendr/blob/1264ad5ae3c491247878470aceed584ba9a87844/extendr-api/src/wrapper/doubles.rs#L40

clauswilke commented 2 years ago

Sorry, that's making a vector of doubles, not integers. But there's comparable code for integers also. See e.g. here: https://github.com/extendr/extendr/blob/1264ad5ae3c491247878470aceed584ba9a87844/extendr-api/src/wrapper/integers.rs#L40

Ilia-Kosenkov commented 2 years ago

Here is my attempt at {cpp11}. Our results are not impressive to be honest. I belive .into() for every operation creates significant overhead. @andy-thomason ?

cpp11::cpp_function(
    "writable::doubles make_cpp(int n) { 
        writable::doubles x(n);
        for(int i = 0; i < n; i++)
            x[i] = i; 
        return x;
    }"
)

rextendr::rust_function(
    "fn make_rust(n : i32) -> Doubles {
        (0..n).map(|x| (x as f64).into()).collect::<Doubles>()
    }",
    extendr_deps = list(
        `extendr-api` = list(git = "https://github.com/extendr/extendr")
    )
)
#> i build directory: 'C:/Users/[redacted]/AppData/Local/Temp/Rtmpu4wuuD/file613c76cd2e14'
#> v Writing 'C:/Users/[redacted]/AppData/Local/Temp/Rtmpu4wuuD/file613c76cd2e14/target/extendr_wrappers.R'.

c(10, 100, 1000, 10000) |>
    purrr::map_dfr(
        \(n) bench::mark(make_rust(n), make_cpp(n)) |> 
            dplyr::mutate(N = n)
    ) |>
    dplyr::select(expression, N, median, `itr/sec`, dplyr::everything())
#> # A tibble: 8 x 7
#>   expression       N   median `itr/sec`      min mem_alloc `gc/sec`
#>   <bch:expr>   <dbl> <bch:tm>     <dbl> <bch:tm> <bch:byt>    <dbl>
#> 1 make_rust(n)    10    4.2us   227781.    3.9us        0B     0   
#> 2 make_cpp(n)     10    500ns  1541414.    400ns        0B   154.  
#> 3 make_rust(n)   100    9.6us   101023.      9us      848B     0   
#> 4 make_cpp(n)    100    700ns  1191838.    500ns      848B     0   
#> 5 make_rust(n)  1000   61.7us    15951.   57.8us    7.86KB     2.02
#> 6 make_cpp(n)   1000    1.1us   601655.    900ns    7.86KB   120.  
#> 7 make_rust(n) 10000  570.2us     1743.  551.9us   78.17KB     2.01
#> 8 make_cpp(n)  10000    5.5us   172149.      5us   78.17KB   259.

Created on 2021-12-21 by the reprex package (v2.0.1)

Ilia-Kosenkov commented 2 years ago

Ok so, this is a typical case of RTFI (this was a bad choice of words directed towards myself) not paying attention to the docs. By default {rextendr} compiles dev version, so in my previous message I quoted incorrect results. Here is release.

cpp11::cpp_function(
    "writable::doubles make_cpp(int n) { 
        writable::doubles x(n);
        for(int i = 0; i < n; i++)
            x[i] = i; 
        return x;
    }"
)

rextendr::rust_function(
    "fn make_rust(n : i32) -> Doubles {
        (0..n).map(|x| (x as f64).into()).collect::<Doubles>()
    }",
    extendr_deps = list(
        `extendr-api` = list(git = "https://github.com/extendr/extendr")
    ),
    profile = "release"
)
#> i build directory: 'C:/Users/[redacted]/AppData/Local/Temp/RtmpUH51JB/file4278337c5e69'
#> v Writing 'C:/Users/[redacted]/AppData/Local/Temp/RtmpUH51JB/file4278337c5e69/target/extendr_wrappers.R'.

c(10, 100, 1000, 10000, 100000) |>
    purrr::map_dfr(
        \(n) bench::mark(make_rust(n), make_cpp(n)) |> 
            dplyr::mutate(N = n)
    ) |>
    dplyr::select(expression, N, median, `itr/sec`, dplyr::everything())
#> # A tibble: 10 x 7
#>    expression        N   median `itr/sec`      min mem_alloc `gc/sec`
#>    <bch:expr>    <dbl> <bch:tm>     <dbl> <bch:tm> <bch:byt>    <dbl>
#>  1 make_rust(n)     10    700ns  1248003.    500ns        0B      0  
#>  2 make_cpp(n)      10    500ns  1527708.    400ns        0B    153. 
#>  3 make_rust(n)    100    900ns   973179.    700ns      848B      0  
#>  4 make_cpp(n)     100    700ns  1122990.    500ns      848B      0  
#>  5 make_rust(n)   1000    1.8us   409342.    1.5us    7.86KB     40.9
#>  6 make_cpp(n)    1000      1us   853239.    900ns    7.86KB    171. 
#>  7 make_rust(n)  10000   17.9us    49989.   10.9us   78.17KB     75.1
#>  8 make_cpp(n)   10000    5.6us   158774.      5us   78.17KB    254. 
#>  9 make_rust(n) 100000  112.8us     7039.  106.4us   781.3KB    116. 
#> 10 make_cpp(n)  100000   52.6us    16082.   48.1us   781.3KB    263.

Created on 2021-12-21 by the reprex package (v2.0.1)

Thell commented 2 years ago

Ok so, this is a typical case of RTFI. By default {rextendr} compiles dev version. Here is release.

I added that make_rust to my local copy of that linked gist and it doesn't fair any better or worse than the intvec. And, yes, as shown in that gist it's set to the release profile. I figured I'd save the reposting... but since that link seems to have been overlooked...

rextendr::rust_source(
  profile = "release",
  extendr_deps = list(
    `extendr-api` = list(git = "https://github.com/extendr/extendr")
  ),
N =  640 
Unit: microseconds
                        expr      min        lq       mean    median        uq       max neval cld
        rcpp_zeros_intvec(N)    2.084    3.9675   10.22898    6.4270   12.4890   501.043  1000 a  
        rust_zeros_intvec(N)    2.845    5.6905   16.50350   12.7440   24.6870   238.985  1000 a  
        rcpp_zeros_intmat(N)   42.460  123.5980  391.89065  364.3910  430.3995 10510.080  1000  b 
        rust_zeros_intmat(N) 1019.979 1570.3000 2054.20790 1779.4945 2081.5370 12317.982  1000   c
 rcpp_zeros_intvec_dimmed(N)   38.522  145.0150  457.99592  377.4905  441.6065 11752.041  1000  b 
                make_rust(N)    3.096    4.8940   15.63082   11.3865   23.9800   113.514  1000 a
Ilia-Kosenkov commented 2 years ago

@Thell, I responded to my (now hidden) message where I ran dev version of Rust code and got terrible results and started to complain about performance before realizing it was my fault all along. I used 'RTFI' towards myself, because I was so frustrated I made this childish mistake. This was a poor choice of words anyway. I apologize.

Speaking of your gist, here are things that may affect the result

  1. Does IntegerVector my_vec(n) initialize underlying buffer? When you do vec!(0; n as usize), I suspect you have an extra zeromem() call, at least. Even if R does it itself, in Rust you have an additional operation.
  2. Using r!(...) is not the best approach performance-wise. We have a new transforming API, try_from, which can be currently enabled as #[extednr(use_try_from = true)]. It will be enabled by default in the future release.
  3. As I mentioned above, matrices are sub-optimal right now. And we did not really touch them lately. Could you also try using ndarray to see if there is any difference?

It is possible that we may never reach raw C++ performance simply because we have additional overhead and type-safety, idiomatic Rust NA and NULL handling, and other features. For matrix initialization, I guess this is the cost of calling the initializer function for every element. Perhaps we need to have a default constructor that takes only dimensions.

Ilia-Kosenkov commented 2 years ago

Also, I think you are missing this function https://gist.github.com/Thell/861d464c7c85ffd4c7285a894b7715f4#file-rcpp_vs_rust_zeros-r-L53-L57. It would be interesting to see if it is faster than matrix construction from dimensions and initializer function.

Thell commented 2 years ago
  • Does IntegerVector my_vec(n) initialize underlying buffer? When you do vec!(0; n as usize), I suspect you have an extra zeromem() call, at least. Even if R does it itself, in Rust you have an additional operation.

Yes, it initializes the buffer in the R memory pool. Earlier I mentioned Rcpp's having gone through performance improvements over time and this is one of the things that looks like has improved because I just ran a comparison initializing and passing back a std::vector and there was no extra overhead... there used to be...

  • Using r!(...) is not the best approach performance-wise. We have a new transforming API, try_from, which can be currently enabled as #[extednr(use_try_from = true)]. It will be enabled by default in the future release.
  • As I mentioned above, matrices are sub-optimal right now. And we did not really touch them lately. Could you also try using ndarray to see if there is any difference?

I will look into that and add those to the gist.

Regarding

I think you are missing this function

I was hoping to find an equivalent to the Rcpp vector with dim attributes set and that missing function was a false start whos only purpose is to remind me I still need to find an equivalent extendr function.

Ilia-Kosenkov commented 2 years ago

Ok, I see. Still, very useful results -- we clearly need to improve matrix performance, and likely before the next release.

Thell commented 2 years ago

So... here's what I think the c++ and rust versions of the dimmed vector -> matrix are (even though I'm not sure about returning the Result<> like that)

IntegerVector rcpp_0s_intmat_dimmedvec(int n) {
    IntegerVector my_vec(n * n);
    my_vec.attr("dim") = Dimension(n, n);
    return my_vec;
}
fn rust_0s_intmat_dimmedvec(n: i32) -> Result<Robj> {
    let my_vec = vec!(0; n as usize * n as usize);
    let my_mat = r!(my_vec).set_attrib(dim_symbol(), [n as usize, n as usize]);
    my_mat
}

Given the vector results from the previous benchmarks I would anticipate that since we are only setting a dim attribute that this rust function should perform competitively with the c++ function. (this benchmark contains 2 tests for these functions... one at the size of the vector tests denoted with _v and one at the size of the matrix tests denoted with _m)

N =  640 
Unit: microseconds
                       expr      min        lq        mean    median        uq       max neval  cld
       rcpp_zeros_stdvec(N)    1.913    4.4435    9.489698    6.3425   11.5020   274.539  1000 a   
       rust_zeros_stdvec(N)    2.675    5.3005   13.750623    8.6115   20.0980   378.368  1000 a   
       rcpp_zeros_intvec(N)    1.723    3.7170    8.420855    5.7010   10.6000    79.643  1000 a   
       rust_zeros_intvec(N)    2.655    4.9140   13.566919    8.2805   20.5245   410.924  1000 a   
     rust_zeros_integers(N)    3.066    6.0010   15.235590    8.9725   21.1305   362.564  1000 a   
 rcpp_0s_dimmed_intvec_v(N)    2.164    4.2480    9.673049    6.6425   12.1430   303.762  1000 a   <-----
 rust_0s_dimmed_intvec_v(N)    3.446    7.2140   17.250222   13.1300   21.9475   527.780  1000 a   <-----
       rcpp_zeros_intmat(N)   33.314  129.7135  301.950395  165.3915  389.0295 11359.324  1000  b  
       rust_zeros_intmat(N) 1068.576 1387.3960 1878.637816 1722.7925 2050.6155 18784.302  1000    d
 rcpp_0s_dimmed_intvec_m(N)   32.533  136.5365  290.744776  170.4560  391.6250  8901.480  1000  b  <-----
 rust_0s_dimmed_intvec_m(N)  581.153  738.2130 1029.551299  889.0520 1074.6720 11822.131  1000   c <-----

As we can see looking at the <----- marked results the rcpp versions are right inline with their intvec and intmat counterparts yet the rust versions aren't, with the intvec tests being quicker and the intmat tests being dramatically quicker. Note that all we've done is add the dim attribute and return Result<Robj> instead of an unwrapped Robj. Again, the expectation I'd have is that there should be an unnoticeable impact of adding the dim attribute.

As a sidebar test I made a R function to call the rust_zeros_intvec and add the dims in R

r_dimmed_rust_intvec_m <- function(n) {
  res <- .Call("wrap__rust_zeros_intvec", n*n, PACKAGE = "librextendr27")
  dim(res) <- c(n,n)
  return(res)
}

and

N =  640 
Unit: microseconds
                          expr     min       lq     mean   median       uq      max neval cld
      rcpp_zeros_intvec(N * N)  40.427 147.2065 238.1183 162.857 182.1690   6458.753  1000  a 
      rust_zeros_intvec(N * N) 573.718 636.6740 987.8065 728.209 840.6595 154891.654  1000   b
 rcpp_zeros_dimmed_intvec_m(N)  42.261 149.3860 256.9951 162.561 183.3860   7796.264  1000  a 
 rust_zeros_dimmed_intvec_m(N) 570.191 646.4225 825.5226 740.067 842.3170   9759.256  1000   b
     r_dimmed_rust_intvec_m(N) 573.408 653.4255 873.8778 744.060 860.4870   8743.313  1000   b

Given these results I'm inclined to think the issue isn't pointing directly at matrix handling.

Thell commented 2 years ago
  • Using r!(...) is not the best approach performance-wise. We have a new transforming API, try_from, which can be currently enabled as #[extednr(use_try_from = true)]. It will be enabled by default in the future release.

Is this the way to use use_try_from? I'm not sure where to see it's enabled and used... Looking at the results definitely didn't help... 😄

rextendr::rust_source(
  profile = "release",
  extendr_deps = list(
    `extendr-api` = list(git = "https://github.com/extendr/extendr")
  ),
  code = '
/// @export
#[extendr(use_try_from = true)]
fn rust_zeros_stdvec_tf(n: i32) -> Vec<i32> {
    let my_vec = vec!(0; n as usize);
    my_vec
}

extendr_module! {
    mod rust_wrap;
    fn rust_zeros_stdvec_tf;
}
')
> N<-640; microbenchmark(rcpp_zeros_stdvec(N*N), rust_zeros_stdvec(N*N), rust_zeros_stdvec_tf(N*N), times=1000)
Unit: microseconds
                        expr     min       lq     mean   median       uq       max neval cld
    rcpp_zeros_stdvec(N * N) 147.833 190.8055 516.7946 218.6940 279.9210 160522.41  1000  a 
    rust_zeros_stdvec(N * N) 575.251 629.3245 881.4189 737.0965 864.4950  12784.64  1000   b
 rust_zeros_stdvec_tf(N * N) 575.893 628.4230 900.2462 730.0025 857.8025  11044.42  1000   b

[edit] Interesting... cargo rust release build on cargo_zeros_intvec <- rust_fn(n,... = 'let (n, xn) = Rval::new_vector_integer(n.as_usize(), &mut pc); for xni in xn.iter_mut() { *xni = 0}; n')

 N<-640; microbenchmark(cargo_zeros_intvec(N*N), times = 1000)                                                                                         
Unit: microseconds
                      expr     min      lq     mean  median      uq      max neval
 cargo_zeros_intvec(N * N) 336.964 382.865 577.2551 416.468 473.595 7546.172  1000
andy-thomason commented 2 years ago

Bear in mind that converting from a Vec<i32> to an integer vector has some overhead. This is similar to returning a std::vector<int> in Rcpp - not the fastest thing.

Using Integers and collect should match the C++ performance provided the compiler flags are set to sensible values. This essentially creates an R array with Rf_allocVector and fills it with values in an autovectorised loop. rextendr protects the R call with a thread guard, which Rcpp does not and so will be slightly slower.

For better results - and I haven't tried it - try passing a vector to the function to fill. The new Integers type supports DerefMut to give a mutable slice which can be filled with a loop or rayon for very large vectors.

An alternative is also to use ALTVEC and convert an iterator to a lazy vector. The cost of doing this is very small, although using the vector in R will almost certainly instantiate the entire vector on first use. Integers supports this behaviour through from_values.

Note that Rust's default values are very "safe" ie. very slow. To get the best result on a platform use RUSTFLAGS="-C target-cpu=native -C opt-level=3" or similar. We should default rextendr::rust_source to "release" and provide some way to set RUSTFLAGS

Ilia-Kosenkov commented 2 years ago

@andy-thomason, can you comment on the matrix performance? While 1.5 performance decrease in vector creation is a 'reasonable' tradeoff, the numbers we get when creating matrices are... discouraging.

andy-thomason commented 2 years ago

I would have to look at the generated code to comment.

The current Matrix implementation has constructors that take a closure of |i, j| -> elem

Much depends on whether the loop will vectorise. There is usually about a 10:1 difference in performance if it does.

The version:

fn rust_0s_intmat_dimmedvec(n: i32) -> Result<Robj> {
    let my_vec = vec!(0; n as usize * n as usize);
    let my_mat = r!(my_vec).set_attrib(dim_symbol(), [n as usize, n as usize]);
    my_mat
}

Should be done using Integers::from_value ideally as using Vec always involves a compete duplication of the vector.

I might make an integers! macro along the lines of:

integers![0; 1024] integers![1, 2, 3, 4] integers![|i| i+1; 1024]

For short vectors (<1k) the R calls will dominate and we add extra overhead by making them thread safe.

clauswilke commented 2 years ago

We should default rextendr::rust_source to "release"

It's unclear that this is the right choice, as it makes compilation noticeably slower. For a function that is most likely going to be used interactively, it's not clear to me that this is the right tradeoff.

Thell commented 2 years ago

Bear in mind that converting from a Vec to an integer vector has some overhead. This is similar to returning a std::vector in Rcpp - not the fastest thing.

That used to be the case but not anymore. I know I've posted quite a few results from quite a few different functions so let's cut this down to size with what I believe to be the most direct comparisons.

# Very basic Native -> R

# The idea behind each test is to have in-hand a native integer vector
# obtained via the result of some native operations on `n` that should be
# returned to R as an integer vector.

# our base line
r_fn <- function(n) {
  vector(mode="integer", length=n)
}

Rcpp::cppFunction('
std::vector<int> rcpp_fn(int n) {
    std::vector<int> my_vec(n, 0);
    return my_vec;
}')

rextendr::rust_function('
fn rextendr_defaults(n: i32) -> Vec<i32> {
    let my_vec = vec!(0; n as usize);
    my_vec
}
')

rextendr::rust_function('
fn rextendr_release(n: i32) -> Vec<i32> {
    let my_vec = vec!(0; n as usize);
    my_vec
}
',
profile = "release")

rextendr::rust_function('
fn rextendr_release_devapi(n: i32) -> Vec<i32> {
    let my_vec = vec!(0; n as usize);
    my_vec
}
',
profile = "release",
extendr_deps = list(
  `extendr-api` = list(git = "https://github.com/extendr/extendr")
))

# cargo with modification of the function required.
#  Afaik it doesn't have any automatic type conversion handling.
cargo_fn <- cargo::rust_fn(n, '
  let rust_n = n.as_usize();
  let my_vec = vec!(0; rust_n);
  let (r_vec, xr_vec) = Rval::new_vector_integer(rust_n, &mut pc);
  for (relem, rustelem) in xr_vec.iter_mut().zip(my_vec) {
   *relem = rustelem;
  };
  r_vec
')

image

Unit: relative
                       expr        min        lq      mean    median        uq        max neval cld
                    r_fn(N)   1.000000  1.000000  1.000000  1.000000  1.000000  1.0000000  1000 a  
                 rcpp_fn(N)   3.583614  1.279764  1.333190  1.326804  1.453437  0.9885287  1000 a  
                cargo_fn(N)   3.509943  1.264357  1.281811  1.309269  1.406735  1.2350064  1000 a  
       rextendr_defaults(N) 307.610555 70.658629 51.260960 71.826732 71.052254  4.7738947  1000   c
        rextendr_release(N)   9.555352  2.674739  2.742079  2.670020  2.785710 16.6881294  1000  b 
 rextendr_release_devapi(N)  10.923281  2.986696  2.674138  3.011260  3.150232  1.1583600  1000  b 

Perhaps its just me not understanding what all is required for 'safety' and what a reasonable trade-off is in time but when I used PyO3 I was so impressed with Rust's improvement over my Python solution that I had to try it with R and ... well, ouch. I think a big part of that 'ouch' is that I am familiar with Rcpp and that's my goto target when I need more performance out of my R so that is what I compare to. The cargo benchmark shows that Rust can, indeed, be just as performant on getting the values back to R but I dont know what the safety checks are in=place that give all the time between that cargo benchmark and the rextendr times.

This also supports, in my opinion, having release be the default because that is going to be the first impression.

Thell commented 2 years ago

So, I've been doing some reading to further understand what @andy-thomason stated in his earlier post along with what the overhead of safety is.

From looking at cargo's new_vector_integer it is allocating and incrementing the protect count and returning -> (Self, &'static mut [i32]) doc src which, if I am understanding what I've been reading about, is inherently unsafe in multiple ways and a primary goal of extendr is to ensure safety by protecting the SEXP behind the Robj interface and providing safe methods of interaction.

From the rust docs...

Note that all interaction with a static mut is unsafe, both reading and writing. Dealing with global mutable state requires a great deal of care.

So, if that very basic explanation is correct, the overhead between cargo's static mut reference methodology and extendr's Robj methodology is the 'cost' of safety, extendability and ease of use (ie. drop-in rust function usage) rather being the cost of a particular method of wrapping values.

Does that sound right?

andy-thomason commented 2 years ago

Also. Vectors over a certain size are currently built as Altrep. We should probably make this optional until we can do this at zero cost.

On Fri, 24 Dec 2021, 20:32 Thell, @.***> wrote:

So, I've been doing some reading to further understand what @andy-thomason https://github.com/andy-thomason stated in his earlier post](#11 (comment) https://github.com/extendr/rextendr/issues/11#issuecomment-1000276268) along with what the overhead of safety is.

From looking at cargo's new_vector_integer it is allocating and incrementing the protect count and returning -> (Self, &'static mut [i32]) doc src https://docs.rs/roxido/0.4.2/src/roxido/r.rs.html#194-196 which, if I am understanding what I've been reading about, is inherently unsafe in multiple ways and a primary goal of extendr is to ensure safety by protecting the SEXP behind the Robj interface and providing safe methods of interaction.

From the rust docs...

Note that all interaction with a static mut is unsafe, both reading and writing. Dealing with global mutable state requires a great deal of care.

So, if that very basic explanation is correct, the overhead between cargo's static mut reference methodology and extendr's Robj methodology is the 'cost' of safety, extendability and ease of use (ie. drop-in rust function usage) rather being the cost of a particular method of wrapping values.

Does that sound right?

— Reply to this email directly, view it on GitHub https://github.com/extendr/rextendr/issues/11#issuecomment-1000927319, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL36XHAP27VYNOLAKV77JTUSTKEFANCNFSM4VW5OERQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

CGMossa commented 2 years ago

While this is an important issue, I believe this particular conversation has gone stale. We should explore a benchmark suit, just internally also to keep track of potential performance regressions. These things tend to grow into their own projects / repos, so for now, we'll leave it at that.

andy-thomason commented 2 years ago

Note that this PR https://github.com/extendr/extendr/pull/401 is likely to fix things.