Hi,
recently I do some CUDA acceleration of the schur complement in ceres, which compute Hsc matrix from Jacobian matrix.
and I encounter some problem about bad performance of CUDA global memory random access in Hsc matrix.
I read the procedure in this project which perform sparse-sparse matrix multiply part, (something like H_lp'Hpl in this code).
It seems that you pre-calculate some [i, j, k] triplets, which is the addresses of matrix mutiply operations, sort them. and perform small matrix multiplication via cuda kernel function.
I was wondering, is this method also suffer from the bad performance of Hsc matrix memory access? and How to tackle it?
Hi, recently I do some CUDA acceleration of the schur complement in ceres, which compute Hsc matrix from Jacobian matrix. and I encounter some problem about bad performance of CUDA global memory random access in Hsc matrix. I read the procedure in this project which perform
sparse-sparse matrix multiply
part, (something like H_lp'Hpl in this code). It seems that you pre-calculate some [i, j, k] triplets, which is the addresses of matrix mutiply operations, sort them. and perform small matrix multiplication via cuda kernel function. I was wondering, is this method also suffer from the bad performance of Hsc matrix memory access? and How to tackle it?