Lack of subsetting and scaling options for MY work

privefl commented 7 years ago

Now, I remember why I couldn't use this package. In most of machine learning applications, you need to subset your data in the training set and a test set. It is much easier if your algorithms takes care of the subsetting for you (just read some of the data, no copy).

The other point is that it doesn't also do scaling. For example, for principal component analysis I need to scale my matrix, but as it is of type char, I don't want to have to make a new double big.matrix, I want my algorithm to take care of the scaling online.

To come back to our Armadillo backend discussion, I tested to multiply 2 big.matrix (5000 rows and respectively 5000 and 50,000 columns) with corresponding vectors. Arma is 1.5 times faster than the naive implementation in Rcpp (which can be optimized up to 3 times faster, so that Arma is then 2 times slower) but the real problem if when you try to use subsetting like Am.rows(ind) * x where Am is the Arma representation of your big.matrix and x is the vector, the computation time is multiplied by 4, whereas it has almost no cost in the Rcpp implementation.

For the optimized library part, I'm currently using the MKL through Microsoft R Open. The optimizations that are available for Arma are also avaialable for R, so that I can't see any difference of performance with R, where I can more easily implement subsetting and scaling.

This is why I have chosen not to use the Arma backend for big.matrix and why I can't use this particular package and am making another package for stats based on bigmemory.

Anyway, thanks for your great contributions to the bigmemory package, it is really central to the work I'm doing in my thesis.

cdeterman commented 7 years ago

Ah, that is noteworthy regarding the subsets. These benchmarks are useful though. You should create a repo to reproduce the benchmarks so as I make improvements here they can be tested.

Also, I was unaware you were using Microsoft R. That is a beast in itself. Again, these packages are designed for all R and maximum flexibility instead of expecting people to run to Microsoft. That said, I would like to see if there is a way to leverage the MKL with armadillo as well.

privefl commented 7 years ago

I've lots of benchmarks and tests everywhere, I'm lost in the middle of it. Do you want to make a repo to compare all the possible way of implementing multiplication?

By the way,

MRO can be installed in like 5 minutes.
When installing MRO, RcppArmadillo automatically uses the MKL as well.

privefl commented 7 years ago

You can see my newest post where I present some benchmarks of matrix-vector multiplication: https://privefl.github.io/blog/Tip-Optimize-your-Rcpp-loops/. This is weird because I don't have the problem of slow subsetting anymore. Yet, Arma is shown to be slower that Eigen and Rcpp for this type of multiplication (no change when using the MKL).

cdeterman commented 7 years ago

Alright, a few followup questions.

What OS are you using? I assume a Linux system?
Slow subsetting not a problem now? Are you saying that the armadillo subsets are now performant to your expectations?
You say Arma is slower than Eigen/Rcpp but no change when using MKL, does that mean that with Revo R that Arma matches or exceeds those?

privefl commented 7 years ago

I'm on CentOS 7 (Linux)
I've created the repo you were talking about. The first benchmark is available: http://htmlpreview.github.io/?https://github.com/privefl/big.benchmarks/blob/master/multMatVec.html.

I'm a little surprised. I'm using type char to store my data which is only 0s, 1s and 2s. I thought it would be more efficient to read from char than from double, but apparently not.. I'm thinking now of storing my 3 different values on 64 bits...
You can see Arma subsetting is OK with type char but really weird with type double. Moreover, (data not shown) Arma is really bad on a filebacked char big.matrix (800+ ms).

I was saying that there is no difference using MRO or not when doing an Arma matrix-vector multiplication, apparently, see http://htmlpreview.github.io/?https://github.com/privefl/big.benchmarks/blob/master/multMatVec-MRO.html.

cdeterman commented 7 years ago

@privefl I have poked around a little bit with you code and benchmarks. I suspect the char functions are having issues because of the conversion between char and double for R. I'm not sure how to really get around this at the moment. I suspect this because by creating a few void functions to test the different types shows char performing at least, if not better, that float and consistently faster than double. Curiously though, even passing an initialized matrix to the function to fill doesn't improve performance. Will need to explore further.

cdeterman commented 7 years ago

@privefl Also, how are you comparing subsets operations when you say armadillo slows down so much? I see the benchmarks for armadillo but none of the others. When creating a subset from non-contiguous elements it is difficult to not require a copy. That is likely what is causing the decrease in performance with armadillo. In many machine learning applications it is actually common to have the program sort the matrix order and then grab the subsets as contiguous elements instead. This is sometimes faster at least in the case where many subsets take place. Food for thought.

privefl commented 7 years ago

For subsetting, I'm just using macc[ind.col[j]][ind.row[i]] instead of macc[j][i]. A 50% subsetting should typically give a 50% reduce of execution time.

Moreover, Eigen or Arma views of a sub.big.matrix on rows won't work because it is not contiguous elements anymore.

privefl commented 7 years ago

I will try to implement two views of a big.matrix in bigmemory:

one for non-contiguous sub.big.matrix,
one for a scaled big.matrix

so that you can just access elements with macc(i, j) rather than (macc[ind.col[j]][ind.row[i]] - mean[j]) / sd[j], which is tedious.

In the near future, I may need some help in finalizing some of my pull requests, so don't hesitate to edit them.

cdeterman commented 7 years ago

@privefl this would be great. I actually was looking for something like this previously but I didn't have the time to invest in building it myself. Generally these are referred to 'gather-scatter' methods. Perhaps something to look into as you develop this.

privefl commented 7 years ago

While waiting for a more general implementation (a sub.big.matrix, not just a SubMatrixAccessor) in the bigmemory package, I use this in one of my package:

template<typename T>
class SubMatrixAccessor {
public:
  typedef T value_type;

public:
  SubMatrixAccessor(BigMatrix &bm,
                    const IntegerVector &row_ind,
                    const IntegerVector &col_ind) {
    _pMat = reinterpret_cast<T*>(bm.matrix());
    _totalRows = bm.total_rows();
    _row_ind = row_ind;
    _col_ind = col_ind;
    _nrow = row_ind.size();
    _ncol = col_ind.size();
  }

  inline T operator() (int i, int j) {
    return *(_pMat + _totalRows * _col_ind[j] + _row_ind[i]);
  }

  int nrow() const {
    return _nrow;
  }

  int ncol() const {
    return _ncol;
  }

protected:
  T *_pMat;
  int _totalRows;
  int _nrow;
  int _ncol;
  IntegerVector _row_ind;
  IntegerVector _col_ind;
};

It enables me to write only macc(i, j) in my Rcpp code.

Were you also thinking of this type of implementation?

cdeterman / bigalgebra

Lack of subsetting and scaling options for MY work #3