schur complement CUDA global memory random access

Hi, recently I do some CUDA acceleration of the schur complement in ceres, which compute Hsc matrix from Jacobian matrix. and I encounter some problem about bad performance of CUDA global memory random access in Hsc matrix. I read the procedure in this project which perform sparse-sparse matrix multiply part, (something like H_lp'Hpl in this code). It seems that you pre-calculate some [i, j, k] triplets, which is the addresses of matrix mutiply operations, sort them. and perform small matrix multiplication via cuda kernel function. I was wondering, is this method also suffer from the bad performance of Hsc matrix memory access? and How to tackle it?

fixstars / cuda-bundle-adjustment

schur complement CUDA global memory random access #13