RcppCore / RcppArmadillo

Rcpp integration for the Armadillo templated linear algebra library
193 stars 55 forks source link

Issue on call by reference with a NA matrix #414

Closed qianchd closed 1 year ago

qianchd commented 1 year ago

For large matrix, it is better to pass the address than the value. However, when using RcppArmadillo, I found a inconsistent behavior such that if all the values in the original R matrix A are NA, the Call by Reference function below can not modify the value directly. But after an identical operation, it works.

Consider the following C++ functions.

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
Rcpp::List mod_mat(arma::mat & A, int & i, int & j, double & x) {
    Rcpp::List out;
    A(i-1, j-1) = x;
    out["A"] = A;
    out["x"] = x;
    return out;
}

// [[Rcpp::export]]
arma::mat return_mat(arma::mat & A) {
  return A;
}

In R,

> A <- matrix(NA, 10, 10)
> print(A[1,1])
[1] NA
> 
> res <- mod_mat(A, 1, 1, 2)
> print(A[1,1])
[1] NA
> 
> A <- return_mat(A)
> res <- mod_mat(A, 1, 1, 2)
> print(A[1,1])
[1] 2

And If A is not a full NA matrix,

> A <- matrix(NA, 10, 10)
> A[2, 2] <- 1
> print(A[1, 1])
[1] NA
> 
> res <- mod_mat(A, 1, 1, 2)
> print(A[1, 1])
[1] 2
> 
> A <- return_mat(A)
> res <- mod_mat(A, 1, 1, 2)
> print(A[1, 1])
[1] 2

Also, I know that RcppArmadillo will make a copy from the SEXP to arma::mat. And in the above code, it seems that such a copy operation only take the effect at the first time? After doing A <- return_mat(A) such that A is returned from a arma::mat object, when A is called, it is directly passed to the C++ function as a pointer that like the Rcpp::NumericMatrix does?

After some further reading, I find it is related to #368. As commented by eddelbuettel, the resulted matrix need to be returned as sugguested. However, in practice, A could be a rather large one, e,g, the $n \times n$ kernel gram matrix, the call by reference method is still a nice feature. I am wondering in the above, the copy only happens once, or it happens every time when mod_mat is called?

Overall, one lesson I learned here is to treat NAs more seriously in Rcpp tools. And the thing that I actually want to figure out is that: What is the workflow of RcppArmadillo when the call-by-reference function like mod_mat is involved? Does it really get some boosting comparing to the call-by-value function, like the pure C++ code?

eddelbuettel commented 1 year ago

I find your example a little hard to follow -- the key is to ensure you have the consistent storage format, which you do not:

> A <- matrix(NA, 2, 2)
> str(A)
 logi [1:2, 1:2] NA NA NA NA
> 

As a logical matrix, this guarantees you get copy as required the interface requesting a numeric (aka real aka double) matrix, breaking any (implicit) call by reference for which you'd need a 'numeric' aka 'real' aka 'double' matrix as the strong typing matters. A lot.

The speed implications have be discussed and examined a lot too. See for example this file.

But most people think the call by reference is an anti-patterm because of edge cases like this. Your mileage may vary.

I think we can close this. Open-ended discussions may be better on the mailing list.