Closed bentsherman closed 6 years ago
I think our matrix library now has full GPU support, but it is not optimized. That is, each matrix function handles GPU communication to ensure that the CPU data and GPU data remain synced. I don't want to have to put GPU read/write calls all over the place in the database and algorithms code, so maybe we could add a flag to matrix_t
that specifies whether the CPU/GPU data are synced. Then if the matrix functions use this flag to control GPU communication, we might be able to minimize communication without cluttering the code.
On a related note, I have tried several times now to refactor the distance functions (COS, L1, L2) with BLAS routines, but neither the CPU nor GPU implementations seem to provide any speedup. I think the main reason is that the data matrices X
and X_test
must be copied into column vectors beforehand, and the L1 and L2 functions have to copy x
each time. Here are some numbers from the MNIST dataset on a Palmetto node with 8 cores:
# default implementation
./scripts/pbs/hyperparameter.sh -d mnist -a pca -p knn_dist > logs/mnist-knn-dist-1.log
COS 0.045 6.163 47.233
L1 0.090 6.337 28.867
L2 0.051 5.847 28.433
# CBLAS implementation
./scripts/pbs/hyperparameter.sh -d mnist -a pca -p knn_dist > logs/mnist-knn-dist-2.log
COS 0.050 6.063 48.733
L1 0.085 5.677 37.567
L2 0.051 6.433 47.400
Moving to mlearn.
I finally got the matrix tests to run on a GPU without crashing. All of the tests passed except for
m_eigen2
andm_sqrtm
, and those functions aren't being used currently so we can go ahead and test the entire system.The main issue with running our system on the GPU is communication -- copying memory between the GPU and the host. For matrix functions that can currently use the GPU, the general workflow is to copy the matrix to the GPU, call the underlying MAGMA function, and copy the output matrix to the host. We want to make sure that the GPU memory is read when it needs to be read so that the rest of the system can use the result, but we also want to minimize the amount of communication. So for example, if we multiply a matrix by three matrices in a row, we only need to copy the matrix from the GPU at the end, not between every step.
So the priority now with GPU code is first to verify our results and second to minimize the amount of communication during matrix operations.