SixiangHu / DataMan

R package for data cleaning, preliminary data analysis and modeling assessing with visualisation.
3 stars 0 forks source link

correlation calculation on numerical dataset #24

Closed SixiangHu closed 8 years ago

SixiangHu commented 8 years ago

Tried to have an improvement about the speed of cor function.

Currently, the following code works (with inline and RcppArmadillo)

src <- '
  mat m1 = as<mat>(x);
  return(wrap(cor(m1)));
  '

  fx <- inline::cxxfunction( signature(x = "numeric") , 
                     body = src, 
                     plugin='RcppArmadillo', 
                     includes = 'using namespace Rcpp;
                                 using namespace arma;')

It is slightly quicker than cor:

library(microbenchmark)
x <- as.matrix(mtcars)
microbenchmark(
  fx(x),
  cor(x)
)
Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval cld
  fx(x) 11.50953 11.67405 11.96615 11.82167 12.21154 13.20968   100  a 
 cor(x) 13.92392 14.22428 14.46255 14.37597 14.72041 15.27963   100   b

But on a 300 x 300 matrix, the time spending is the same. In this case, there is no benefit to continue on this?

SixiangHu commented 8 years ago

According to Dirk:

Lastly, to come back to the OP's initial question, you are of course right. Armadillo calls the LAPACK / BLAS routines; so it ends up making the same call as R does. (Rcpp)Armadillo makes a lot of other transformations faster, but the core multiplication is about the same as both R and Armadillo (and everybody else, apart from Eigen) "just" call the specialised BLAS. As one should.

http://finzi.psych.upenn.edu/R/library/RcppArmadillo/html/RcppArmadillo-package.html

Hence this code doesn't have much benefit.

It may be useful for other people, hence I've pasted the code here:

C++:

#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
NumericVector Corr(NumericVector x) {
  arma::mat m1 = as<arma::mat>(x);
  return wrap(cor(m1));
}

The important steps are

Include the RcppArmadillo.h header file, which also includes armadillo.h.

Import Rcpp, and LinkingTo Rcpp and RcppArmadillo by adding these lines to the DESCRIPTION file:

      Imports: Rcpp (>= 0.11.0)
      LinkingTo: Rcpp, RcppArmadillo

Link against the BLAS and LAPACK libraries, by adding this line in the Makevars and Makevars.win files:

PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) 
  • from ?RcppArmadillo in R
SixiangHu commented 6 years ago
#### Prepare Data ####
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))

#### Method 1: Hmisc::rcorr ####
# rcorr calling fortran language without parallel calculation
library(Hmisc)
correlations2 <- rcorr(as.matrix(the_data))

#### Method 2: rcpp ####
src <- '
  mat m1 = as<mat>(x);
  return wrap(arma::cor(m1));
'

fx <- inline::cxxfunction(signature(x = "matrix") , 
                           body = src, 
                           plugin='RcppArmadillo', 
                           includes = 'using namespace Rcpp;
                           using namespace arma;')

#### results ####
library(microbenchmark)
microbenchmark(
  c1 <- rcorr(as.matrix(the_data)),
  c2 <- fx(as.matrix(the_data)),
  c3 <- cor(as.matrix(the_data))
)
                             expr     min       lq     mean   median       uq      max neval cld
 c1 <- rcorr(as.matrix(the_data)) 771.129 777.1020 821.3307 791.0405 830.5785 1126.116   100   c
    c2 <- fx(as.matrix(the_data)) 246.899 254.5785 295.9165 260.6940 304.7820  566.898   100 a  
   c3 <- cor(as.matrix(the_data)) 287.574 294.6850 338.5660 302.5075 328.8175  733.583   100  b