NVIDIA / warp

A Python framework for high performance GPU simulation and graphics
https://nvidia.github.io/warp/
Other
4.09k stars 226 forks source link

Matrix decompositions inside CUDA kernel? #297

Open coreqode opened 3 weeks ago

coreqode commented 3 weeks ago

Is it possible to do matrix decompositions of small matrices (let's say of size 9x9 or 12x12) inside the CUDA kernel? My final goal is to run Newton method inside the kernel which requires inverse of a matrix.

mmacklin commented 3 weeks ago

Hi @coreqode, currently we only support small (3x3) decompositions - but we are planning more general decompositions, most likely for symmetric decompositions, i.e.: Cholesky first, although possibly also LU.

Would Cholesky work for your problem, are there any other types of decompositions you need for these larger sizes?

gdaviet commented 3 weeks ago

Note that in the meantime it is also possible to implement said decompositions directly in Warp. For instance, here are some QR and Cholesky decomposition routines that I have been using:

@wp.func
def householder_qr_decomposition(A: Any):
    """
    QR decomposition of a square matrix using Householder reflections

    Returns Q and R such that Q R = A, Q orthonormal (such that QQ^T = Id), R upper triangular
    """

    x = type(A[0])()
    Q = wp.identity(n=type(x).length, dtype=A.dtype)

    zero = x.dtype(0.0)
    two = x.dtype(2.0)

    for i in range(type(x).length):
        for k in range(type(x).length):
            x[k] = wp.select(k < i, A[k, i], zero)

        alpha = wp.length(x) * wp.sign(x[i])
        x[i] += alpha
        two_over_x_sq = wp.select(alpha == zero, two / wp.length_sq(x), zero)

        A -= wp.outer(two_over_x_sq * x, x * A)
        Q -= wp.outer(Q * x, two_over_x_sq * x)

    return Q, A
def compute_ldlt(A: Any):
    """LDLT factorization of a symmetric matrix A
    D stored in diagonal
    L stored in lower-triangular part"""

    D = type(wp.get_diag(A))(A.dtype(0.0))

    for j in range(D.length):
        tmp = wp.cw_mul(A[j], D)

        D[j] = A[j, j] - wp.dot(tmp, A[j])

        for i in range(j + 1, D.length):
            A[i, j] = A[i, j] - wp.dot(A[i], tmp)

        for i in range(j + 1, D.length):
            A[i, j] = A[i, j] / D[j]

        A[j, j] = D[j]

    return A
coreqode commented 3 weeks ago

@mmacklin Thanks for the reply. For my use case, Hessian is mostly SPD so Cholesky (LDLT variant) would work. SVD would also be super-helpful which, can be used for doing polar decomposition and other things.

Thanks @gdaviet for sharing the code (especially QR). I have been using Cholesky in Taichi like this only because it doesn't support nested kernels. However, I was wondering, if nested kernels can be used here for more speedup as for some of the energy Hessians we get 18x18 matrix.