ROCm / hipSPARSE

ROCm SPARSE marshalling library
https://rocm.docs.amd.com/projects/hipSPARSE/en/latest/
MIT License
69 stars 42 forks source link

[Issue]: hipsparseDcsrsv2 and HIPSPARSE_MATRIX_TYPE_SYMMETRIC #541

Open kalmarek opened 1 month ago

kalmarek commented 1 month ago

Problem Description

I'm trying to solve a sparse symmetric system; Originally my matrix is CSC, square, only lower triangle is stored. This is how I create sparse hip matrix:

  // A is of size n × m, with
  // A->n, A->m : the sizes
  // A->p : column pointers
  // A->i : row indices
  // A->x : data as doble*
  int nnz = A->p[A->n]; // The last element of A->p gives the number of non-zeros

  // Allocate memory on device
  hipMalloc(&(work->d_vals), nnz * sizeof(double));          // Matrix values (number of non-zeros)
  hipMalloc(&(work->d_row_ptrs), (A->n + 1) * sizeof(int)); // Column pointers (size n + 1)
  hipMalloc(&(work->d_col_inds), nnz * sizeof(int));        // Row indices (number of non-zeros)

  // Copy matrix to device
  // here we switch to CSR, only upper triangle is stored
  hipMemcpy(work->d_vals, A->x, nnz * sizeof(double), hipMemcpyHostToDevice);
  hipMemcpy(work->d_row_ptrs, A->p, (A->n + 1) * sizeof(int), hipMemcpyHostToDevice);
  hipMemcpy(work->d_col_inds, A->i, nnz * sizeof(int), hipMemcpyHostToDevice);

then I set matrix descriptor like this

  // Initialize matrix descriptor
  status = hipsparseCreateMatDescr(&(work->descr));
  if (status != HIPSPARSE_STATUS_SUCCESS)
  {
    printf("Error in init -- descriptor: %d.\n", (int)status);
  }
  hipsparseSetMatIndexBase(work->descr, HIPSPARSE_INDEX_BASE_ZERO); // Zero-based indexing
  if (status != HIPSPARSE_STATUS_SUCCESS)
  {
    printf("Error in init -- index 0: %d.\n", (int)status);
  }
  status = hipsparseSetMatType(work->descr, HIPSPARSE_MATRIX_TYPE_SYMMETRIC); // Symmetric matrix
  if (status != HIPSPARSE_STATUS_SUCCESS)
  {
    printf("Error in init -- type symmetric: %d.\n", (int)status);
  }
  status = hipsparseSetMatFillMode(work->descr, HIPSPARSE_FILL_MODE_UPPER); // stored in upper-diagonal part
  if (status != HIPSPARSE_STATUS_SUCCESS)
  {
    printf("Error in init -- fill upper: %d.\n", (int)status);
  }
  status = hipsparseSetMatDiagType(work->descr, HIPSPARSE_DIAG_TYPE_NON_UNIT); // with non-unit diagonal
  if (status != HIPSPARSE_STATUS_SUCCESS)
  {
    printf("Error in init -- diagonal non-unit: %d.\n", (int)status);
  }

and finally I want to start analyzing for csrsv2:

  status = hipsparseCreate(&(work->handle)); // HIPSPARSE_STATUS_SUCCESS
  // Create info object for LU decomposition
  status = hipsparseCreateCsrsv2Info(&(work->info_LU)); // HIPSPARSE_STATUS_SUCCESS
  // Analyze step to get buffer size
  status = hipsparseDcsrsv2_bufferSize(work->handle, HIPSPARSE_OPERATION_NON_TRANSPOSE,
                                       A->m, nnz, work->descr, work->d_vals, work->d_row_ptrs,
                                       work->d_col_inds, work->info_LU, &(work->bufferSize));

here I get HIPSPARSE_STATUS_NOT_SUPPORTED.

The only difference I see from the example here https://rocmdocs.amd.com/projects/hipSPARSE/en/latest/reference/level2.html#hipsparsexcsrsv2-solve is that I set my matrix to be symmetric;

btw. in the example the matrix is clearly non-symmetric, yet the FILL_MODE is set. do I misunderstand what does this parameter mean?

Operating System

ubuntu 24.04.01

CPU

AMD Ryzen 7 7840U

GPU

AMD Radeon VII

ROCm Version

ROCm 6.2.0

ROCm Component

hipSPARSE

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

$ rocminfo --support ROCk module version 6.8.5 is loaded =====================
HSA System Attributes
=====================
Runtime Version: 1.14 Runtime Ext Version: 1.6 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED DMAbuf Support: YES

==========
HSA Agents
==========


Agent 1


Name: AMD Ryzen 7 7840U w/ Radeon 780M Graphics Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 7840U w/ Radeon 780M Graphics Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5289
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 28504680(0x1b2f268) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 28504680(0x1b2f268) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 28504680(0x1b2f268) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx1103
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 2048(0x800) KB
Chip ID: 5567(0x15bf)
ASIC Revision: 9(0x9)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2700
BDFID: 49920
Internal Node ID: 1
Compute Unit: 12
SIMDs per CU: 2
Shader Engines: 1
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties: APU Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 39
SDMA engine uCode:: 18
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 14252340(0xd97934) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 14252340(0xd97934) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1103
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
Done

Additional Information

No response

kswirydo commented 1 month ago

Please add me to this issue.

jsandham commented 1 month ago

@kalmarek Thanks for raising this issue. I believe this is caused because hipSPARSE calls rocSPARSE (on AMD systems) and in rocSPARSE we currently do not support symmetric matrices. In fact in rocSPARSE, we currently only support the general matrix type (no symmetric, triangular, hermitian). This restriction does not really make sense and is something we need to fix.

I am looking into extending the support in rocsparse to more matrix types.

jsandham commented 1 month ago

I think I have a solution for your problem. Ill use a concrete example. If I understand what you are trying to do, you have a CSC lower triangular matrix, lets say:

// CSC // 2 0 0 0 // 0 1 0 0 // 3 0 3 0 // 4 1 0 2 hcsc_col_ptr = {0, 3, 5, 6, 7}; hcsc_row_ind = {0, 2, 3, 1, 3, 2, 3}; hcsc_val = {2.0, 3.0, 4.0, 1.0, 1.0, 3.0, 2.0};

You are then interpreting this as a CSR matrix by passing the column pointer array as if it was the row pointer array in CSR, passing the row indices array as if it was the column indices array in CSR, and finally passing the values array. This interpretation results in:

// CSR // 2 0 3 4 // 0 1 0 1 // 0 0 3 0 // 0 0 0 2 hcsr_row_ptr = {0, 3, 5, 6, 7}; // same as hcsc_col_ptr hcsr_col_ind = {0, 2, 3, 1, 3, 2, 3}; // same as hcsc_row_ind hcsr_val = {2.0, 3.0, 4.0, 1.0, 1.0, 3.0, 2.0f}; // same as hcsc_val

This is an upper triangular matrix in CSR format. Now if I understand things correctly, you want to solve the original lower triangular system. I think what you want to do is:

1) Interpret your CSC matrix as a CSR matrix like you are doing 2) As a CSR matrix, this is an upper triangular matrix, so you want to continue setting the fill mode as HIPSPARSE_FILL_MODE_UPPER. 3) Because the HIPSPARSE_MATRIX_TYPE_TRIANGULAR matrix type is not supported, you will need to use HIPSPARSE_MATRIX_TYPE_GENERAL. 4) Pass HIPSPARSE_OPERATION_TRANSPOSE instead of HIPSPARSE_OPERATION_NON_TRANSPOSE so that the system you are solving (stored as a CSR upper triangular matrix) is transposed and therefore results in your original lower triangular system.

```
rocsparse_set_mat_type(descr, rocsparse_matrix_type_general);
rocsparse_set_mat_fill_mode(descr, rocsparse_fill_mode_upper);
rocsparse_set_mat_diag_type(descr, rocsparse_diag_type_non_unit);

rocsparse_status status;

size_t buffer_size;
status = rocsparse_scsrsv_buffer_size(handle,
                            rocsparse_operation_transpose,
                            m,
                            nnz,
                            descr,
                            dcsr_val,
                            dcsr_row_ptr,
                            dcsr_col_ind,
                            info,
                            &buffer_size);

void* dbuffer = nullptr;
hipMalloc((void**)&dbuffer, buffer_size);

status = rocsparse_scsrsv_analysis(handle,
                        rocsparse_operation_transpose,
                        m,
                        nnz,
                        descr,
                        dcsr_val,
                        dcsr_row_ptr,
                        dcsr_col_ind,
                        info,
                        analysis,
                        solve,
                        dbuffer);
status = rocsparse_scsrsv_solve(handle,
                        rocsparse_operation_transpose,
                        m,
                        nnz,
                        dalpha,
                        descr,
                        dcsr_val,
                        dcsr_row_ptr,
                        dcsr_col_ind,
                        info,
                        dx,
                        dy,
                        solve,
                        dbuffer);
```

Please let me know if I am misunderstanding your issue.

kalmarek commented 1 month ago

@jsandham thanks for replies! What I'm trying to do is to solve Ax=b with changing bs and constant sparsity pattern in A. So I want to analyze & factorize A first, and then

While re-solves happen very often (every iteration), the diagonal rescaling of A happens rather rarely. Still A can be rather large (in the range of 1e6-1e7).

A comes to me as sparse CSC matrix, symmetric, but only the lower triangle is stored.

Here's a mwe with example data A from the first non-trivial test of ours baked in. It corresponds to the following matrix:

10×10 SparseMatrixCSC{Float64, Int64} with 18 stored entries:
 1.0e-6      ⋅           ⋅     0.736161    1.18921     ⋅      ⋅      ⋅           ⋅      ⋅ 
  ⋅         1.0e-6       ⋅    -0.642353     ⋅          ⋅      ⋅    -0.875448     ⋅      ⋅ 
  ⋅          ⋅        -10.0     ⋅           ⋅          ⋅      ⋅      ⋅           ⋅      ⋅ 
 0.736161  -0.642353     ⋅   -10.0          ⋅          ⋅      ⋅      ⋅           ⋅      ⋅ 
 1.18921     ⋅           ⋅      ⋅        -10.0         ⋅      ⋅      ⋅           ⋅      ⋅ 
  ⋅          ⋅           ⋅      ⋅           ⋅       -10.0     ⋅      ⋅           ⋅      ⋅ 
  ⋅          ⋅           ⋅      ⋅           ⋅          ⋅   -10.0     ⋅           ⋅      ⋅ 
  ⋅        -0.875448     ⋅      ⋅           ⋅          ⋅      ⋅   -10.0          ⋅      ⋅ 
  ⋅          ⋅           ⋅      ⋅           ⋅          ⋅      ⋅      ⋅        -10.0     ⋅ 
  ⋅          ⋅           ⋅      ⋅           ⋅          ⋅      ⋅      ⋅           ⋅   -10.0

commented is also the expanded version (where both triangles are stored).

// hipcc -D__HIP_PLATFORM_AMD__ -I/opt/rocm/include -L/opt/rocm/lib -lhipsparse mwe.c -o mwe
#include <stdio.h>

#include <hip/hip_runtime.h>
#include <hipsparse/hipsparse.h>

typedef struct
{
    /** Matrix values, size: number of non-zeros. */
    double *x;
    /** Matrix row indices, size: number of non-zeros. */
    int *i;
    /** Matrix column pointers, size: `n+1`. */
    int *p;
    /** Number of rows. */
    int m;
    /** Number of columns. */
    int n;
} CSCMatrix;

CSCMatrix *init_data()
{
    CSCMatrix *A = calloc(1, sizeof(CSCMatrix));

    // symmetric expanded
    // int p[] = {0, 3, 6, 7, 10, 12, 13, 14, 16, 17, 18};
    // int i[] = {0, 3, 4, 1, 3, 7, 2, 0, 1, 3, 0, 4, 5, 6, 1, 7, 8, 9};
    // double x[] = {1.0e-6, 0.736161, 1.189207, 1.0e-6, -0.642353, -0.875448, -10.0, 0.736161, -0.642353, -10.0, 1.189207, -10.0, -10.0, -10.0, -0.875448, -10.0, -10.0, -10.0};

    // lower triangle only

    int p[] = {0, 3, 6, 7, 8, 9, 10, 11, 12, 13, 14};
    int i[] = {0, 3, 4, 1, 3, 7, 2, 3, 4, 5, 6, 7, 8, 9};
    double x[] = {1.0e-6, 0.736161, 1.189207, 1.0e-6, -0.642353, -0.875448, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0};

    A->m = A->n = 10;
    A->p = p;
    A->i = i;
    A->x = x;
    return A;
}

int main()
{
    hipsparseStatus_t status;
    hipsparseHandle_t handle;
    status = hipsparseCreate(&(handle));
    if (status != HIPSPARSE_STATUS_SUCCESS)
    {
        printf("Error in init -- handle: %d.\n", (int)status);
    }

    hipsparseMatDescr_t descr;
    status = hipsparseCreateMatDescr(&(descr));
    if (status != HIPSPARSE_STATUS_SUCCESS)
    {
        printf("Error in init -- descriptor: %d.\n", (int)status);
    }
    status = hipsparseSetMatIndexBase(descr, HIPSPARSE_INDEX_BASE_ZERO);   // Zero-based indexing
    status = hipsparseSetMatType(descr, HIPSPARSE_MATRIX_TYPE_SYMMETRIC);  // Symmetric matrix
    status = hipsparseSetMatDiagType(descr, HIPSPARSE_DIAG_TYPE_NON_UNIT); // with non-unit diagonal
    status = hipsparseSetMatFillMode(descr, HIPSPARSE_FILL_MODE_UPPER);    // data matrix is symmetric and expanded anyway

    // CSR matrix data on the device
    double *d_vals;  // Non-zero values of the matrix (on device)
    int *d_row_ptrs; // Row pointers (on device)
    int *d_col_inds; // Column indices (on device)

    CSCMatrix *A = init_data(); // A is symmetric in our example!!!
    int nnz = A->p[A->n];       // The last element of A->p gives the number of non-zeros

    hipMalloc((void **)&(d_vals), nnz * sizeof(double));         // Matrix values (number of non-zeros)
    hipMalloc((void **)&(d_row_ptrs), (A->n + 1) * sizeof(int)); // Column pointers (size n + 1)
    hipMalloc((void **)&(d_col_inds), nnz * sizeof(int));        // Row indices (number of non-zeros)

    hipMemcpy(d_vals, A->x, nnz * sizeof(double), hipMemcpyHostToDevice);
    hipMemcpy(d_row_ptrs, A->p, (A->n + 1) * sizeof(int), hipMemcpyHostToDevice);
    hipMemcpy(d_col_inds, A->i, nnz * sizeof(int), hipMemcpyHostToDevice);

    csrsv2Info_t info_LU;
    status = hipsparseCreateCsrsv2Info(&info_LU);
    if (status != HIPSPARSE_STATUS_SUCCESS)
    {
        printf("Error in init -- create info: %d.\n", (int)status);
    }

    void *buffer;
    int buffer_size;
    hipsparseOperation_t NOOP = HIPSPARSE_OPERATION_NON_TRANSPOSE;
    hipsparseSolvePolicy_t NOPOLICY = HIPSPARSE_SOLVE_POLICY_NO_LEVEL;
    status = hipsparseDcsrsv2_bufferSize(handle, NOOP, A->m, nnz, descr,
                                         d_vals, d_row_ptrs, d_col_inds,
                                         info_LU, &buffer_size);
    if (status != HIPSPARSE_STATUS_SUCCESS)
    {
        printf("Error in init -- bufferSize: %d.\n", (int)status);
    }
    printf("Allocating %d bytes for csrsv2.\n", buffer_size);
    hipMalloc((void **)&buffer, buffer_size);

    // Perform symbolic factorization for LU
    status = hipsparseDcsrsv2_analysis(handle, NOOP, A->m, nnz, descr,
                                       d_vals, d_row_ptrs, d_col_inds,
                                       info_LU, NOPOLICY, buffer);
    if (status != HIPSPARSE_STATUS_SUCCESS)
    {
        printf("Error in init -- analysis: %d.\n", (int)status);
    }
    // Perform numeric factorization (LU decomposition)
    double alpha = 1.0;
    status = hipsparseDcsrsv2_solve(handle, NOOP, A->m, nnz, &alpha, descr,
                                    d_vals, d_row_ptrs, d_col_inds,
                                    info_LU, NULL, NULL, NOPOLICY, buffer);
    if (status != HIPSPARSE_STATUS_SUCCESS)
    {
        printf("Error in init -- numeric fact: %d.\n", (int)status);
    }
    return status;
}
kalmarek commented 1 month ago

@kalmarek Thanks for raising this issue. I believe this is caused because hipSPARSE calls rocSPARSE (on AMD systems) and in rocSPARSE we currently do not support symmetric matrices. In fact in rocSPARSE, we currently only support the general matrix type (no symmetric, triangular, hermitian). This restriction does not really make sense and is something we need to fix.

I am looking into extending the support in rocsparse to more matrix types.

@jsandham extending rocsparse to support symmetric matrices would be lovely! keep me posted!

kalmarek commented 1 month ago

@jsandham just the last comment about mwe above:

even with A expanded symmetric and

    status = hipsparseSetMatIndexBase(descr, HIPSPARSE_INDEX_BASE_ZERO); // Zero-based indexing
    // status = hipsparseSetMatType(descr, HIPSPARSE_MATRIX_TYPE_SYMMETRIC);  // Symmetric matrix
    status = hipsparseSetMatDiagType(descr, HIPSPARSE_DIAG_TYPE_NON_UNIT); // with non-unit diagonal
    // status = hipsparseSetMatFillMode(descr, HIPSPARSE_FILL_MODE_UPPER);    // data matrix is symmetric and expanded anyway

I receive HIPSPARSE_STATUS_INVALID_VALUE and 0 bytes are allocated for the buffer.

jsandham commented 1 month ago

I think your problem lies in how you are creating your CSCMatrix and setting the data:

CSCMatrix *init_data()
{
    CSCMatrix *A = calloc(1, sizeof(CSCMatrix));

    // symmetric expanded
    // int p[] = {0, 3, 6, 7, 10, 12, 13, 14, 16, 17, 18};
    // int i[] = {0, 3, 4, 1, 3, 7, 2, 0, 1, 3, 0, 4, 5, 6, 1, 7, 8, 9};
    // double x[] = {1.0e-6, 0.736161, 1.189207, 1.0e-6, -0.642353, -0.875448, -10.0, 0.736161, -0.642353, -10.0, 1.189207, -10.0, -10.0, -10.0, -0.875448, -10.0, -10.0, -10.0};

    // lower triangle only

    int p[] = {0, 3, 6, 7, 8, 9, 10, 11, 12, 13, 14};
    int i[] = {0, 3, 4, 1, 3, 7, 2, 3, 4, 5, 6, 7, 8, 9};
    double x[] = {1.0e-6, 0.736161, 1.189207, 1.0e-6, -0.642353, -0.875448, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0};

    A->m = A->n = 10;
    A->p = p;
    A->i = i;
    A->x = x;
    return A;
}

Specifically I don't think you can set the arrays using local variables that go out of scope once init_data is finished. Hint: print the value of nnz. I would recommend the following:

// CSR matrix data on the device
double *d_vals;  // Non-zero values of the matrix (on device)
int *d_row_ptrs; // Row pointers (on device)
int *d_col_inds; // Column indices (on device)

int p[] = {0, 3, 6, 7, 8, 9, 10, 11, 12, 13, 14};
int i[] = {0, 3, 4, 1, 3, 7, 2, 3, 4, 5, 6, 7, 8, 9};
double x[] = {1.0e-6, 0.736161, 1.189207, 1.0e-6, -0.642353, -0.875448, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0, -10.0};

CSCMatrix* A = malloc(sizeof(CSCMatrix)); // I would not use malloc/calloc here at all...
A->m = A->n = 10;
int nnz = p[A->m];   // The last element of A->p gives the number of non-zeros
A->p = malloc(sizeof(int) * (A->m + 1));
A->i = malloc(sizeof(int) * nnz);
A->x = malloc(sizeof(double) * nnz);

memcpy(A->p, p, sizeof(int) * (A->m + 1));
memcpy(A->i, i, sizeof(int) * nnz);
memcpy(A->x, x, sizeof(double) * nnz);

hipMalloc((void **)&(d_vals), nnz * sizeof(double));         // Matrix values (number of non-zeros)
hipMalloc((void **)&(d_row_ptrs), (A->n + 1) * sizeof(int)); // Column pointers (size n + 1)
hipMalloc((void **)&(d_col_inds), nnz * sizeof(int));        // Row indices (number of non-zeros)

// At the end
free(A->p);
free(A->i);
free(A->x);
free(A);

Doing this allows the buffer size and analysis to pass.

The call to hipsparseDcsrsv2_solve will fail as no right hand size or solution vector has been provided.

jsandham commented 1 month ago

Just another note. I am a little worried by:

// Perform numeric factorization (LU decomposition)
double alpha = 1.0;
status = hipsparseDcsrsv2_solve(handle, NOOP, A->m, nnz, &alpha, descr,
                                    d_vals, d_row_ptrs, d_col_inds,
                                    info_LU, NULL, NULL, NOPOLICY, buffer);

Specifically the code comment suggesting that this routine will perform an LU decomposition. Just to be clear, that is not what this routine does. It is the final step in solving the system Mx = b where M is either upper or lower triangular. It sounds like what you may be trying to achieve is to first compute the LU factorization, i.e. A=LU, and then using this factorization to solve:

Ax=b Compute A=LU resulting in system LUx=b Let y = Ux Solve Ly=b first Then solve Ux = y to get x.

If this is the case, you may want to look at https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/reference/precond.html#rocsparse-csrilu0 example. Note that this is an incomplete LU factorization with no fill-in so will not solve Ax=b exactly. Incomplete LU factorization is designed for use in preconditioners for iterative solvers.

As this example shows, you will need to make multiple calls to hipsparseDcsrsv2_solve but you will also need to first call rocsparse_dcsrilu0

kalmarek commented 1 month ago

@jsandham thanks for the tip about CSCMatrix initialization ;) This is literally my first C code since high school, so double thanks for the patience!

Ok, I understand some more now. I see that hipSPARSE exposes csrilu0 as well. All of those factorisations are performed without pivoting though. I guess I need to reorder the matrix it myself then (e.g. using KLU or some other external sources). Or maybe is there something in hipSPARSE for this?

jsandham commented 1 month ago

@kalmarek In the steps I mentioned before, i.e:

1) Given Ax=b, compute A=LU resulting in system LUx=b 2) Let y = Ux and then solve Ly=b first 3) Finally solve Ux = y to get x.

Steps 2) and 3) can be completed using hipsparse/rocsparse/cusparse (specifically the triangular solver you are currently using) but step 1) cannot.

You will either need to perform the LU decomposition yourself and provide the lower and upper triangular L/U matrices to hipsparse/rocsparse/cusparse or will need to use another library that do this for you.

Alternatively you may be able to use hipSOLVER https://rocm.docs.amd.com/projects/hipSOLVER/en/latest/reference/sparse-api/sparse.html#hipsolversp-type-csrlsvchol to complete all three steps in one go. Currently it looks like hipSOLVER only supports solving the sparse system Ax=b using Cholesky factorization which will require your A matrix to be symmetric positive definite. I think they are working on adding csrlsvlu which would solve the system Ax=b using LU factorization (exactly what you are trying to achieve!) but I don't this has been added yet. I messaged the rocSOLVER team to inquire about when they might have this added and ill report back once I know.

kalmarek commented 1 month ago

@jsandham perfect, thanks for the information about hipSOLVER; indeed that was my first library I tried to read about but I did not find what I needed. I'm looking forward to hearing from you!

jsandham commented 1 month ago

@kalmarek I just spoke with our rocsolver team. They are not currently working on adding csrlsvlu but they are working on adding csrlsvqr (which should also work for you in solving Ax=b) and is targeted for the 6.4 release.

YvanMokwinski commented 1 month ago

Hello, A few additional comments.

Using the QR factorization as a decomposition for your symmetric matrix will work, but it is far from optimal for the work required.

Is your symmetric matrix you want to get a decomposition from a definite positive matrix? If this is the case, one approach to solve your sparse linear system could be to use iterative methods using an incomplete Cholesky factorization as a preconditioner. We have a library for solving sparse linear systems on GPU with iterative methods:

https://github.com/ROCm/rocALUTION

If you are looking for a direct linear solver that can operate all its phases (symbolic reordering, numerical factorization, numerical pivoting) on GPU, then you must be aware that this is still an active field of research and that most packages offer a hybrid type of calculation by porting a part of it only to accelerators.

kalmarek commented 1 month ago

@YvanMokwinski thanks for the info, all these comments are much appreciated!

Unfortunately my matrix is just symmetric and not psd; This is a matrix coming from self-dual homogeneous embedding (see e.g. https://link.springer.com/article/10.1007/s10957-016-0892-3). So it is of the form

B = [ R₁ + P   Aᵗ ] 
    [ A       -R₂ ]

where

Every iteration of our algorithm we need to solve Bx = b. The sparsity pattern does not change, so we need to perform the symbolic factorization only once. The numerical re-factorization only needs to be performed occasionally when the scaling R₁ and R₂ is modified.

We have already an iterative approach (CG solver using CUDA) and on my problems the CPU based one (e.g. using the PARDISO solver) proves to be 2-3 times faster.

If you are looking for a direct linear solver that can operate all its phases (symbolic reordering, numerical factorization, numerical pivoting) on GPU, then you must be aware that this is still an active field of research and that most packages offer a hybrid type of calculation by porting a part of it only to accelerators.

Indeed I've already briefly talked to @kswirydo and I'm trying now to cook a mwe using the https://github.com/ORNL/ReSolve library

YvanMokwinski commented 1 month ago

@kalmarek

Every iteration of our algorithm we need to solve Bx = b. The sparsity pattern does not change, so we need to perform the symbolic factorization only once. The numerical re-factorization only needs to be performed occasionally when the scaling R₁ and R₂ is modified.

Numerical parts A and P do not change, only R1 and R2? I feel iterative refactorization could be an elegant approach here.

https://repository.kaust.edu.sa/items/5454af3b-f0ee-4565-b5bd-f6a06f302b3f

Indeed I've already briefly talked to @kswirydo and I'm trying now to cook a mwe using the https://github.com/ORNL/ReSolve library

Are you aware that rocSOLVER offers a refactorization routine rocsolver_dcsrrf? That might help your cooking. The idea is to provide a refactorization algorithm, assuming a priori reordering and the creation of fill-ins have already been performed. No pivoting feature is offered. This is based on Gaussian elimination (with the use of rocsparse_dcsrilu0). Unfortunately, there is no option to use the iterative approach described in the paper mentioned above. Still, it is available in rocSPARSE with rocsparse_dcsritilu0 (Iterative ILU0).

YvanMokwinski commented 1 month ago

@kalmarek

Could you share some matrices corresponding to a few iterations?

kalmarek commented 1 month ago

@YvanMokwinski I usually work much higher up the stack (i.e. I don't really work directly with B, it is created for me), but I'll try to cook something up and get back to you!