Performance Improvement in Random Forest Implementation

I was profiling the random forest implementation and found a bottleneck which can be possibly improved.

I found the following portion of code, which involves just transposing a matrix, comprises about 14% of the time of a classify.shared command's run:

for (int i = 0; i < numSamples; i++) { 
        if (m->control_pressed) { break; }
        for (int j = 0; j < numFeatures; j++) { bootstrappedFeatureVectors[j][i] = bootstrappedTrainingSamples[i][j]; }
}

It is located at line 45 of the class rftreenode.cpp. The main reason of this bottleneck is, here we are accessing the bootstrappedFeatureVectors in column major order, which involves a lot of cache misses (details can be found here: http://stackoverflow.com/questions/4716125/accessing-elements-of-a-matrix-row-wise-versus-column-wise). May be the class can be designed in a way so that we don't need to transpose the matrix at all? (I've already tried, but couldn't come up with a possible solution). Here is the call tree I've got by running the following command on mothur: classify.shared(shared=dataset.shared, design=dataset.design, numtrees=10) calltree It is generated using valgrind together with kcachegrind.

azmfaridee / mothur

Performance Improvement in Random Forest Implementation #41