Test and debug face recognition system on a GPU

CUFCTL / face-recognition

A GPU-accelerated real-time face recognition system based on classical machine learning algorithms

MIT License

23 stars 11 forks source link

Test and debug face recognition system on a GPU #26

Closed bentsherman closed 6 years ago

bentsherman commented 7 years ago

I finally got the matrix tests to run on a GPU without crashing. All of the tests passed except for m_eigen2 and m_sqrtm, and those functions aren't being used currently so we can go ahead and test the entire system.

The main issue with running our system on the GPU is communication -- copying memory between the GPU and the host. For matrix functions that can currently use the GPU, the general workflow is to copy the matrix to the GPU, call the underlying MAGMA function, and copy the output matrix to the host. We want to make sure that the GPU memory is read when it needs to be read so that the rest of the system can use the result, but we also want to minimize the amount of communication. So for example, if we multiply a matrix by three matrices in a row, we only need to copy the matrix from the GPU at the end, not between every step.

So the priority now with GPU code is first to verify our results and second to minimize the amount of communication during matrix operations.

bentsherman commented 7 years ago

I think our matrix library now has full GPU support, but it is not optimized. That is, each matrix function handles GPU communication to ensure that the CPU data and GPU data remain synced. I don't want to have to put GPU read/write calls all over the place in the database and algorithms code, so maybe we could add a flag to matrix_t that specifies whether the CPU/GPU data are synced. Then if the matrix functions use this flag to control GPU communication, we might be able to minimize communication without cluttering the code.

bentsherman commented 7 years ago

On a related note, I have tried several times now to refactor the distance functions (COS, L1, L2) with BLAS routines, but neither the CPU nor GPU implementations seem to provide any speedup. I think the main reason is that the data matrices X and X_test must be copied into column vectors beforehand, and the L1 and L2 functions have to copy x each time. Here are some numbers from the MNIST dataset on a Palmetto node with 8 cores:

# default implementation
./scripts/pbs/hyperparameter.sh -d mnist -a pca -p knn_dist > logs/mnist-knn-dist-1.log
COS 0.045 6.163 47.233
L1  0.090 6.337 28.867
L2  0.051 5.847 28.433

# CBLAS implementation
./scripts/pbs/hyperparameter.sh -d mnist -a pca -p knn_dist > logs/mnist-knn-dist-2.log
COS 0.050 6.063 48.733
L1  0.085 5.677 37.567
L2  0.051 6.433 47.400

bentsherman commented 6 years ago

Moving to mlearn.