Open privefl opened 7 years ago
Ah, that is noteworthy regarding the subsets. These benchmarks are useful though. You should create a repo to reproduce the benchmarks so as I make improvements here they can be tested.
Also, I was unaware you were using Microsoft R. That is a beast in itself. Again, these packages are designed for all R and maximum flexibility instead of expecting people to run to Microsoft. That said, I would like to see if there is a way to leverage the MKL with armadillo as well.
I've lots of benchmarks and tests everywhere, I'm lost in the middle of it. Do you want to make a repo to compare all the possible way of implementing multiplication?
By the way,
You can see my newest post where I present some benchmarks of matrix-vector multiplication: https://privefl.github.io/blog/Tip-Optimize-your-Rcpp-loops/. This is weird because I don't have the problem of slow subsetting anymore. Yet, Arma is shown to be slower that Eigen and Rcpp for this type of multiplication (no change when using the MKL).
Alright, a few followup questions.
char
to store my data which is only 0s, 1s and 2s. I thought it would be more efficient to read from char
than from double
, but apparently not.. I'm thinking now of storing my 3 different values on 64 bits...char
but really weird with type double
. Moreover, (data not shown) Arma is really bad on a filebacked char
big.matrix
(800+ ms).@privefl I have poked around a little bit with you code and benchmarks. I suspect the char
functions are having issues because of the conversion between char
and double
for R. I'm not sure how to really get around this at the moment. I suspect this because by creating a few void functions to test the different types shows char
performing at least, if not better, that float
and consistently faster than double
. Curiously though, even passing an initialized matrix to the function to fill doesn't improve performance. Will need to explore further.
@privefl Also, how are you comparing subsets operations when you say armadillo slows down so much? I see the benchmarks for armadillo but none of the others. When creating a subset from non-contiguous elements it is difficult to not require a copy. That is likely what is causing the decrease in performance with armadillo. In many machine learning applications it is actually common to have the program sort the matrix order and then grab the subsets as contiguous elements instead. This is sometimes faster at least in the case where many subsets take place. Food for thought.
For subsetting, I'm just using macc[ind.col[j]][ind.row[i]]
instead of macc[j][i]
.
A 50% subsetting should typically give a 50% reduce of execution time.
Moreover, Eigen or Arma views of a sub.big.matrix
on rows won't work because it is not contiguous elements anymore.
I will try to implement two views of a big.matrix
in bigmemory
:
sub.big.matrix
,big.matrix
so that you can just access elements with macc(i, j)
rather than (macc[ind.col[j]][ind.row[i]] - mean[j]) / sd[j]
, which is tedious.
In the near future, I may need some help in finalizing some of my pull requests, so don't hesitate to edit them.
@privefl this would be great. I actually was looking for something like this previously but I didn't have the time to invest in building it myself. Generally these are referred to 'gather-scatter' methods. Perhaps something to look into as you develop this.
While waiting for a more general implementation (a sub.big.matrix
, not just a SubMatrixAccessor
) in the bigmemory
package, I use this in one of my package:
template<typename T>
class SubMatrixAccessor {
public:
typedef T value_type;
public:
SubMatrixAccessor(BigMatrix &bm,
const IntegerVector &row_ind,
const IntegerVector &col_ind) {
_pMat = reinterpret_cast<T*>(bm.matrix());
_totalRows = bm.total_rows();
_row_ind = row_ind;
_col_ind = col_ind;
_nrow = row_ind.size();
_ncol = col_ind.size();
}
inline T operator() (int i, int j) {
return *(_pMat + _totalRows * _col_ind[j] + _row_ind[i]);
}
int nrow() const {
return _nrow;
}
int ncol() const {
return _ncol;
}
protected:
T *_pMat;
int _totalRows;
int _nrow;
int _ncol;
IntegerVector _row_ind;
IntegerVector _col_ind;
};
It enables me to write only macc(i, j)
in my Rcpp code.
Were you also thinking of this type of implementation?
Now, I remember why I couldn't use this package. In most of machine learning applications, you need to subset your data in the training set and a test set. It is much easier if your algorithms takes care of the subsetting for you (just read some of the data, no copy).
The other point is that it doesn't also do scaling. For example, for principal component analysis I need to scale my matrix, but as it is of type
char
, I don't want to have to make a new doublebig.matrix
, I want my algorithm to take care of the scaling online.To come back to our Armadillo backend discussion, I tested to multiply 2
big.matrix
(5000 rows and respectively 5000 and 50,000 columns) with corresponding vectors. Arma is 1.5 times faster than the naive implementation in Rcpp (which can be optimized up to 3 times faster, so that Arma is then 2 times slower) but the real problem if when you try to use subsetting likeAm.rows(ind) * x
whereAm
is the Arma representation of yourbig.matrix
andx
is the vector, the computation time is multiplied by 4, whereas it has almost no cost in the Rcpp implementation.For the optimized library part, I'm currently using the MKL through Microsoft R Open. The optimizations that are available for Arma are also avaialable for R, so that I can't see any difference of performance with R, where I can more easily implement subsetting and scaling.
This is why I have chosen not to use the Arma backend for
big.matrix
and why I can't use this particular package and am making another package for stats based onbigmemory
.Anyway, thanks for your great contributions to the
bigmemory
package, it is really central to the work I'm doing in my thesis.