NVIDIA / AMGX

Distributed multigrid linear solver library on GPU
498 stars 143 forks source link

[Issue]solution depends on the number of levels of AMG grid #305

Open fxuecnu opened 6 months ago

fxuecnu commented 6 months ago

solution depends on the number of levels of AMG grid

I am encountering a convergence issue when using AMGX to solve a system of equations. When working with smaller arrays and the AMG number of levels set to 1, the solver successfully converges.

Process 0 selecting device 0
AMGX version 2.2.0.132-opensource
Built on Nov 23 2023, 14:14:00
Compiled with CUDA Runtime 11.8, using CUDA driver 12.2
Cannot read file as JSON object, trying as AMGX config
Converting config string to current config version
Parsing configuration string: tolerance=1e-16 ;
Cannot read file as JSON object, trying as AMGX config
Converting config string to current config version
Parsing configuration string: exception_handling=1 ;
idx = 0, nnz = 0
Using Normal MPI (Hostbuffer) communicator...
AMG Grid:
         Number of Levels: 1
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
         --------------------------------------------------------------
           0(D)          227               901    0.0175       1.35e-05
         --------------------------------------------------------------
         Grid Complexity: 1
         Operator Complexity: 1
         Total Memory Usage: 1.34669e-05 GB
         --------------------------------------------------------------
           iter      Mem Usage (GB)       residual           rate
         --------------------------------------------------------------
            Ini             1.84131   1.002957e-02
              0             1.84131   3.473817e-17         0.0000
         --------------------------------------------------------------
         Total Iterations: 1
         Avg Convergence Rate:                   0.0000
         Final Residual:                   3.473817e-17
         Total Reduction in Residual:      3.463574e-15
         Maximum Memory Usage:                    1.841 GB
         --------------------------------------------------------------
Total Time: 0.00351504
    setup: 0.00237533 s
    solve: 0.00113971 s
    solve(per iteration): 0.00113971 s

However, as soon as I slightly increase the size of the array and adjust the AMG number of levels to 2, the solver fails to converge.

 AMGX version 2.2.0.132-opensource
Built on Nov 23 2023, 14:14:00
Compiled with CUDA Runtime 11.8, using CUDA driver 12.2
Cannot read file as JSON object, trying as AMGX config
Converting config string to current config version
Parsing configuration string: tolerance=1e-16 ;
Cannot read file as JSON object, trying as AMGX config
Converting config string to current config version
Parsing configuration string: exception_handling=1 ;
idx = 0, nnz = 0
Using Normal MPI (Hostbuffer) communicator...
AMG Grid:
         Number of Levels: 2
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
         --------------------------------------------------------------
           0(D)          229               909    0.0173       1.72e-05
           1(D)          128               412    0.0251        1.3e-05
         --------------------------------------------------------------
         Grid Complexity: 1.55895
         Operator Complexity: 1.45325
         Total Memory Usage: 3.02382e-05 GB
         --------------------------------------------------------------
           iter      Mem Usage (GB)       residual           rate
         --------------------------------------------------------------
            Ini             1.84131   1.002984e-02
              0             1.84131   1.002984e-02         1.0000
              1              1.8413   1.002984e-02         1.0000
              2              1.8413   1.002984e-02         1.0000
              3              1.8413   1.002984e-02         1.0000
              4              1.8413   1.002984e-02         1.0000
              5              1.8413   1.002984e-02         1.0000
              6              1.8413   1.002984e-02         1.0000
              7              1.8413   1.002984e-02         1.0000
              8              1.8413   1.002984e-02         1.0000
              9              1.8413   1.002984e-02         1.0000
             10              1.8413   1.002984e-02         1.0000
             11              1.8413   1.002984e-02         1.0000
             12              1.8413   1.002984e-02         1.0000
             13              1.8413   1.002984e-02         1.0000
             14              1.8413   1.002984e-02         1.0000
             15              1.8413   1.002984e-02         1.0000
             16              1.8413   1.002984e-02         1.0000
             17              1.8413   1.002984e-02         1.0000
             18              1.8413   1.002984e-02         1.0000
             19              1.8413   1.002984e-02         1.0000
             20              1.8413   1.002984e-02         1.0000
             21              1.8413   1.002984e-02         1.0000
             22              1.8413   1.002984e-02         1.0000
             23              1.8413   1.002984e-02         1.0000
             24              1.8413   1.002984e-02         1.0000
             25              1.8413   1.002984e-02         1.0000
             26              1.8413   1.002984e-02         1.0000
             27              1.8413   1.002984e-02         1.0000
             28              1.8413   1.002984e-02         1.0000
             29              1.8413   1.002984e-02         1.0000
             30              1.8413   1.002984e-02         1.0000
             31              1.8413   1.002984e-02         1.0000
             32              1.8413   1.002984e-02         1.0000
             33              1.8413   1.002984e-02         1.0000
             34              1.8413   1.002984e-02         1.0000
             35              1.8413   1.002984e-02         1.0000
             36              1.8413   1.002984e-02         1.0000
             37              1.8413   1.002984e-02         1.0000
             38              1.8413   1.002984e-02         1.0000
             39              1.8413   1.002984e-02         1.0000
             40              1.8413   1.002984e-02         1.0000
             41              1.8413   1.002984e-02         1.0000
             42              1.8413   1.002984e-02         1.0000
             43              1.8413   1.002984e-02         1.0000
             44              1.8413   1.002984e-02         1.0000
             45              1.8413   1.002984e-02         1.0000
             46              1.8413   1.002984e-02         1.0000
             47              1.8413   1.002984e-02         1.0000
             48              1.8413   1.002984e-02         1.0000
             49              1.8413   1.002984e-02         1.0000
             50              1.8413   1.002984e-02         1.0000
             51              1.8413   1.002984e-02         1.0000
             52              1.8413   1.002984e-02         1.0000
             53              1.8413   1.002984e-02         1.0000
             54              1.8413   1.002984e-02         1.0000
             55              1.8413   1.002984e-02         1.0000
             56              1.8413   1.002984e-02         1.0000
             57              1.8413   1.002984e-02         1.0000
             58              1.8413   1.002984e-02         1.0000
             59              1.8413   1.002984e-02         1.0000
             60              1.8413   1.002984e-02         1.0000
             61              1.8413   1.002984e-02         1.0000
             62              1.8413   1.002984e-02         1.0000
             63              1.8413   1.002984e-02         1.0000
             64              1.8413   1.002984e-02         1.0000
             65              1.8413   1.002984e-02         1.0000
             66              1.8413   1.002984e-02         1.0000
             67              1.8413   1.002984e-02         1.0000
             68              1.8413   1.002984e-02         1.0000
             69              1.8413   1.002984e-02         1.0000
             70              1.8413   1.002984e-02         1.0000
             71              1.8413   1.002984e-02         1.0000
             72              1.8413   1.002984e-02         1.0000
             73              1.8413   1.002984e-02         1.0000
             74              1.8413   1.002984e-02         1.0000
             75              1.8413   1.002984e-02         1.0000
             76              1.8413   1.002984e-02         1.0000
             77              1.8413   1.002984e-02         1.0000
             78              1.8413   1.002984e-02         1.0000
             79              1.8413   1.002984e-02         1.0000
             80              1.8413   1.002984e-02         1.0000
             81              1.8413   1.002984e-02         1.0000
             82              1.8413   1.002984e-02         1.0000
             83              1.8413   1.002984e-02         1.0000
             84              1.8413   1.002984e-02         1.0000
             85              1.8413   1.002984e-02         1.0000
             86              1.8413   1.002984e-02         1.0000
             87              1.8413   1.002984e-02         1.0000
             88              1.8413   1.002984e-02         1.0000
             89              1.8413   1.002984e-02         1.0000
             90              1.8413   1.002984e-02         1.0000
             91              1.8413   1.002984e-02         1.0000
             92              1.8413   1.002984e-02         1.0000
             93              1.8413   1.002984e-02         1.0000
             94              1.8413   1.002984e-02         1.0000
             95              1.8413   1.002984e-02         1.0000
             96              1.8413   1.002984e-02         1.0000
             97              1.8413   1.002984e-02         1.0000
             98              1.8413   1.002984e-02         1.0000
             99              1.8413   1.002984e-02         1.0000
         --------------------------------------------------------------
         Total Iterations: 100
         Avg Convergence Rate:                   1.0000
         Final Residual:                   1.002984e-02
         Total Reduction in Residual:      1.000000e+00
         Maximum Memory Usage:                    1.841 GB
         --------------------------------------------------------------
Total Time: 0.139064
    setup: 0.00440851 s
    solve: 0.134655 s
    solve(per iteration): 0.00134655 s

Environment information:

AMGX solver configuration

{ "config_version": 2, "solver": { "preconditioner": { "print_grid_stats": 1, "print_vis_data": 0, "solver": "AMG", "print_solve_stats": 0, "interpolator": "D2", "presweeps": 1, "max_iters": 1, "monitor_residual": 0, "store_res_history": 0, "scope": "amg", "cycle": "V", "postsweeps": 1 }, "solver": "FGMRES", "print_solve_stats": 1, "obtain_timings": 1, "max_iters": 100, "monitor_residual": 1, "gmres_n_restart": 20, "convergence": "RELATIVE_INI_CORE", "scope": "main", "tolerance": 0.0001, "norm": "L2" } }

Matrix Data

col_idxs.txt rhs.txt row_ptrs.txt values.txt Additionally, if you prefer, you can also use the following code to generate the matrix in CSR form and solve it using AMGX

Reproduction steps

Below is the code snippet. You can directly copy the entire code into a .c file and run it. In the file, you only need to adjust the value of nx. When nx = 113, AMGX automatically uses only one level, and the result converges; when nx = 114, the array size is only slightly larger, AMGX happens to split into 2 levels, but the results do not converge. Furthermore, as nx increases, regardless of the number of levels, it will not converge.

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>
#include "cuda_runtime.h"

#include <stdint.h>

#define M_PI 3.14159265358979323846

 /* CUDA error macro */
#define CUDA_SAFE_CALL(call) do {                                 \
  cudaError_t err = call;                                         \
  if(cudaSuccess != err) {                                        \
    fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n", \
            __FILE__, __LINE__, cudaGetErrorString( err) );       \
    exit(EXIT_FAILURE);                                           \
  } } while (0)

//#define AMGX_DYNAMIC_LOADING
//#undef AMGX_DYNAMIC_LOADING
#define MAX_MSG_LEN 4096

/* standard or dynamically load library */
#ifdef AMGX_DYNAMIC_LOADING
#include "amgx_capi.h"
#else
#include "amgx_c.h"
#endif

/* print error message and exit */
void errAndExit(const char* err)
{
    printf("%s\n", err);
    fflush(stdout);
    MPI_Abort(MPI_COMM_WORLD, 1);
    exit(1);
}

/* print callback (could be customized) */
void print_callback(const char* msg, int length)
{
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0) { printf("%s", msg); }
}

/* print usage and exit */
void printUsageAndExit()
{
    char msg[MAX_MSG_LEN] = "Usage: mpirun [-n nranks] ./example [-mode [dDDI | dDFI | dFFI]] [-p nx ny] [-c config_file] [-amg \"variable1=value1 ... variable3=value3\"] [-gpu] [-it k]\n";
    strcat(msg, "     -mode:                select the solver mode\n");
    strcat(msg, "     -p nx ny:             select x- and y-dimensions of the 2D (5-points) local discretization of the Poisson operator (the global problem size will be nranks*nx*ny)\n");
    strcat(msg, "     -c:                   set the amg solver options from the config file\n");
    strcat(msg, "     -amg:                 set the amg solver options from the command line\n");
    print_callback(msg, MAX_MSG_LEN);
    MPI_Finalize();
    exit(0);
}

/* parse parameters */
int findParamIndex(char** argv, int argc, const char* parm)
{
    int count = 0;
    int index = -1;

    for (int i = 0; i < argc; i++)
    {
        if (strncmp(argv[i], parm, 100) == 0)
        {
            index = i;
            count++;
        }
    }

    if (count == 0 || count == 1)
    {
        return index;
    }
    else
    {
        char msg[MAX_MSG_LEN];
        sprintf(msg, "ERROR: parameter %s has been specified more than once, exiting\n", parm);
        print_callback(msg, MAX_MSG_LEN);
        exit(1);
    }

    return -1;
}

double dhudx(double hw, double he, double uw, double ue, double dx)
{
    return (he * ue - hw * uw) / dx;
}

double ududx(double uw, double up, double ue, double dx)
{
    // up>=0 then return up*(up-uw)/dx, otherwise return up*(ue-up)/dx
    // use fmax and fmin to avoid branching
    return fmax(0., up) * (up - uw) / dx + fmin(0., up) * (ue - up) / dx;

}

double gdetadx(double gra, double etaw, double etae, double dx)
{
    return gra * (etae - etaw) / dx;
}

double friction(double gra, double mann, double h, double u)
{
    return gra * mann * mann / pow(h, 4.0 / 3.0) * fabs(u) * u;
}

void continuityXdirKernel(
    int n,
    int nx,
    int ny,
    double dx,
    int* ndd,
    double* eta,
    double* u,
    double* v,
    double* h,
    double* detadt,
    void* values,
    void* rhs,
    int* row_ptrs) {

    /* -------------------- continuity equation ---------------------------
    -----------------------detadt = dhu/dx + dhv/dy------------------------
    detadt is constant, so it is not included in the matrix. In other words,
    the pivot element is 0 for the continuity equation.
    -------------------------------------------------------------------- */

    for (int i = 0; i < n; i++) {

        int row = i / nx;                               // row index in the 2D work array       
        int col = i % nx;                               // col index in the 2D work array    
        int idx = 2 * col + 1 + (2 * nx + 1) * row;     // pivot index in the row of sparse matrix
        int nnz = row_ptrs[idx];                        // index of the first non-zero element in the row

        // open boundary condition for water level
        if (ndd[i] == 2) {

            ((double*)values)[nnz] = 1.0;
            ((double*)rhs)[idx] = - eta[i];

        }
        // inner points
        else {

            int iu = col + row * (nx + 1);  // index of u located at the left edge of the cell

            double dhudx0 = dhudx(h[iu], h[iu + 1], u[iu], u[iu + 1], dx);

            // left u coefficient
            ((double*)values)[nnz] = -h[iu] / dx;

            // midddle eta coefficient
            ((double*)values)[nnz + 1] = 2. * fabs(dhudx0 - detadt[i]);

            // right u coefficient
            ((double*)values)[nnz + 2] = h[iu + 1] / dx;

            //rhs values
            ((double*)rhs)[idx] = - dhudx0 + detadt[i];

        }

    }

}

void momentumXdirKernel(
    int n,
    int nx,
    int ny,
    double dx,
    double gra,
    double mann,
    double Du,
    int* ndd,
    double* eta,
    double* u,
    double* v,
    double* h,
    void* values,
    void* rhs,
    int* row_ptrs) {

    /* -------------------- momentum equation u ---------------------------
    * 0 = g * mann^2 / h^(4/3) * |uv| * u     // friction term
          + udu/dx + vdu/dy                   // advection term
          g * deta/dx                         // gravity term
          - d/dx (Du * du/dx)                 // diffusion term
    -------------------------------------------------------------------- */

    double epsu = 1e-7;
    double epse = 1e-9;

    //int i = blockIdx.x * blockDim.x + threadIdx.x;
    for (int i = 0; i < n + ny; i++) {

        int row = i / (nx + 1);                         // row index in the 2D work array       
        int col = i % (nx + 1);                         // col index in the 1D work array    
        int idx = 2 * col + (2 * nx + 1) * row;         // pivot index in the row of sparse matrix
        int nnz = row_ptrs[idx];                        // index of the first non-zero element in the row

        int ie = col + row * nx;    // index of eta located at the center of the cell

        // inner points
        if (col > 0 && col < nx) {

            double adv = ududx(u[i - 1], u[i], u[i + 1], dx);
            double slp  = gdetadx(gra, eta[ie - 1], eta[ie], dx);
            double frc  = friction(gra, mann, h[i], u[i]);

            double advw = ududx(u[i - 1] + epsu, u[i], u[i + 1], dx);
            double advp = ududx(u[i - 1], u[i] + epsu, u[i + 1], dx);
            double adve = ududx(u[i - 1], u[i], u[i + 1] + epsu, dx);

            double slpw = gdetadx(gra, eta[ie - 1] + epse, eta[ie], dx);
            double slpe = gdetadx(gra, eta[ie - 1], eta[ie] + epse, dx);

            double frcp = friction(gra, mann, h[i], u[i] + epsu);

            // left U
            ((double*)values)[nnz] = 1. / epsu * (advw - adv);

            // left eta
            ((double*)values)[nnz + 1] = 1. / epse * (slpw - slp);

            // center U
            ((double*)values)[nnz + 2] = 1. / epsu * (advp - adv)
                                       + 1. / epsu * (frcp - frc)
                                       + 2. * fabs(adv + slp + frc);

            // right eta
            ((double*)values)[nnz + 3] = 1. / epse * (slpe - slp);

            // right U
            ((double*)values)[nnz + 4] = 1. / epsu * (adve - adv);

            //rhs values
            ((double*)rhs)[idx] = - adv - slp - frc;

        }
        // left boundary
        else if (col == 0) {

            // open boundary condition
            if (ndd[ie] == 2) {

                // center U
                ((double*)values)[nnz] = 1.;

                // right U
                ((double*)values)[nnz + 1] = - 2.;

                // right right U
                ((double*)values)[nnz + 2] = 1.;

                //rhs values
                ((double*)rhs)[idx] = 2. * u[i + 1] - u[i] - u[i + 2];

            }

            // closed boundary condition
            else {

                // center U
                ((double*)values)[nnz] = 1.;

                //rhs values
                ((double*)rhs)[idx] = - u[i];

            }

        }
        // right boundary
        else if (col == nx) {

            // open boundary condition
            if (ndd[ie - 1] == 2) {

                // left left U
                ((double*)values)[nnz] = 1.;

                // left U
                ((double*)values)[nnz + 1] = -2.;

                // center U
                ((double*)values)[nnz + 2] = 1.;

                //rhs values
                ((double*)rhs)[idx] = 2. * u[i - 1] - u[i] - u[i - 2];

            }

            // closed boundary condition
            else {

                // center U
                ((double*)values)[nnz] = 1.;

                //rhs values
                ((double*)rhs)[idx] = - u[i];

            }

        }
    }

}

// 

int main(int argc, char** argv)
{
    //parameter parsing
    int pidx = 0;
    int pidy = 0;
    //MPI (with CUDA GPUs)
    int rank = 0;
    int lrank = 0;
    int nranks = 0;
    int n;
    int nx, ny;
    int gpu_count = 0;
    MPI_Comm amgx_mpi_comm = MPI_COMM_WORLD;
    //versions
    int major, minor;
    char* ver, * date, * time;
    //input matrix and rhs/solution
    int* partition_sizes = NULL;
    int* partition_vector = NULL;
    int partition_vector_size = 0;
    //library handles
    AMGX_Mode mode;
    AMGX_config_handle cfg;
    AMGX_resources_handle rsrc;
    AMGX_matrix_handle A;
    AMGX_vector_handle b, x;
    AMGX_solver_handle solver;
    //status handling
    AMGX_SOLVE_STATUS status;
    /* MPI init (with CUDA GPUs) */
    //MPI
    MPI_Init(&argc, &argv);
    MPI_Comm_size(amgx_mpi_comm, &nranks);
    MPI_Comm_rank(amgx_mpi_comm, &rank);
    //CUDA GPUs
    CUDA_SAFE_CALL(cudaGetDeviceCount(&gpu_count));
    lrank = rank % gpu_count;
    CUDA_SAFE_CALL(cudaSetDevice(lrank));
    printf("Process %d selecting device %d\n", rank, lrank);

    /* load the library (if it was dynamically loaded) */
#ifdef AMGX_DYNAMIC_LOADING
    void* lib_handle = NULL;
#ifdef _WIN32
    lib_handle = amgx_libopen("amgxsh.dll");
#else
    lib_handle = amgx_libopen("libamgxsh.so");
#endif

    if (lib_handle == NULL)
    {
        errAndExit("ERROR: can not load the library");
    }

    //load all the routines
    if (amgx_liblink_all(lib_handle) == 0)
    {
        amgx_libclose(lib_handle);
        errAndExit("ERROR: corrupted library loaded\n");
    }

#endif
    /* init */
    AMGX_SAFE_CALL(AMGX_initialize());
    AMGX_SAFE_CALL(AMGX_initialize_plugins());
    /* system */
    AMGX_SAFE_CALL(AMGX_register_print_callback(&print_callback));
    AMGX_SAFE_CALL(AMGX_install_signal_handler());

    /* get api and build info */
    if ((pidx = findParamIndex(argv, argc, "--version")) != -1)
    {
        AMGX_get_api_version(&major, &minor);
        printf("amgx api version: %d.%d\n", major, minor);
        AMGX_get_build_info_strings(&ver, &date, &time);
        printf("amgx build version: %s\nBuild date and time: %s %s\n", ver, date, time);
        AMGX_SAFE_CALL(AMGX_finalize_plugins());
        AMGX_SAFE_CALL(AMGX_finalize());
        /* close the library (if it was dynamically loaded) */
#ifdef AMGX_DYNAMIC_LOADING
        amgx_libclose(lib_handle);
#endif
        MPI_Finalize();
        exit(0);
    }

    mode = AMGX_mode_dDDI;
    int sizeof_m_val = ((AMGX_GET_MODE_VAL(AMGX_MatPrecision, mode) == AMGX_matDouble)) ? sizeof(double) : sizeof(float);
    int sizeof_v_val = ((AMGX_GET_MODE_VAL(AMGX_VecPrecision, mode) == AMGX_vecDouble)) ? sizeof(double) : sizeof(float);

    AMGX_SAFE_CALL(AMGX_config_create_from_file(&cfg, "FGMRES.json"));
    AMGX_SAFE_CALL(AMGX_config_add_parameters(&cfg, "tolerance=1e-16"));

    /* example of how to handle errors */
    //char msg[MAX_MSG_LEN];
    //AMGX_RC err_code = AMGX_resources_create(NULL, cfg, &amgx_mpi_comm, 1, &lrank);
    //AMGX_SAFE_CALL(AMGX_get_error_string(err_code, msg, MAX_MSG_LEN));
    //printf("ERROR: %s\n",msg);
    /* switch on internal error handling (no need to use AMGX_SAFE_CALL after this point) */
    AMGX_SAFE_CALL(AMGX_config_add_parameters(&cfg, "exception_handling=1"));
    /* create resources, matrix, vector and solver */
    AMGX_resources_create(&rsrc, cfg, &amgx_mpi_comm, 1, &lrank);
    AMGX_matrix_create(&A, rsrc, mode);
    AMGX_vector_create(&x, rsrc, mode);
    AMGX_vector_create(&b, rsrc, mode);
    AMGX_solver_create(&solver, rsrc, mode, cfg);
    //generate 3D Poisson matrix, [and rhs & solution]
    //WARNING: use 1 ring for aggregation and 2 rings for classical path
    int nrings; //=1; //=2;
    AMGX_config_get_default_number_of_rings(cfg, &nrings);
    //printf("nrings=%d\n",nrings);
    int nglobal = 0;

    nx = 114;
    ny = 1;
    n = nx * ny;
    nglobal = n * nranks;

    /* generate the matrix
       In more detail, this routine will create 2D (5 point) discretization of the
       Poisson operator. The discretization is performed on a the 2D domain consisting
       of nx and ny points in x- and y-dimension respectively. Each rank processes it's
       own part of discretization points. Finally, the rhs and solution will be set to
       a vector of ones and zeros, respectively. */

    int* row_ptrs = (int*)malloc((2 * n + ny + 1) * sizeof(int));
    int64_t* col_idxs = (int64_t*)malloc(6 * (2 * n + ny) * sizeof(int64_t));

    void* values = malloc(6 * (2 * n + ny) * sizeof(double)); // maximum nnz
    int nnz = 0;
    int64_t count = 0;
    int64_t start_idx = rank * n;

    ///////////////////////////////////////////////////////
    ///////////////////////////////////////////////////////

    double dx = .5;
    double gra = 9.81;
    double mann = 0.012;
    double Du = 0.5;

    double Trange = 1.0;
    double Tperiod = 43200.0;

    int* ndd;
    double* u;
    double* v;
    double* eta;
    double* h;
    double* detadt;
    void* rhs;
    void* h_x;

    // Allocate memory on host
    ndd = (int*)malloc(n * sizeof(int));
    u = (double*)malloc((n + ny) * sizeof(double));
    v = (double*)malloc((n + nx) * sizeof(double));
    eta = (double*)malloc(n * sizeof(double));
    h = (double*)malloc((n + ny)* sizeof(double));
    detadt = (double*)malloc(n * sizeof(double));
    rhs = malloc((2 * n + ny) * sizeof(double));
    h_x = malloc((2 * n + ny) * sizeof(double));
    memset(h_x, 0., (2 * n + ny) * sizeof(double));

    //for (int i = 0; i < 2 * n + ny; i++) { 
    //     if (i % 2 == 0) {
    //         ((double*)h_x)[i] = 0.0;
    //     }
    //     else {
    //         ((double*)h_x)[i] = 0.0;
    //     }   
    // }

    // Initialize host memory
    for (int i = 0; i < n; i++) {

        int row = i / nx;                               // row index in the 2D work array   
        int col = i % nx;                               // col index in the 1D work array

        if (col == 0) {
            ndd[i] = 2;
        }
        else {
            ndd[i] = 0;
        }

        eta[i] = 0.0;
        detadt[i] = M_PI * Trange / Tperiod;

    }

    for (int i = 0; i < n + ny; i++) {
        h[i] = 2.0;
    }

    for (int i = 0; i < n + nx; i++) {
        v[i] = 0.0;
    }

    // set random value to u array
    for (int i = 0; i < n + ny; i++) {
        u[i] = 0.;
        u[i] = 0.01;
        //u[i] = (double)rand() / RAND_MAX * 2. - 1.;
    }

    // initialize sparse matrix (determine row_ptrs and col_idxs)

    nnz = 0;

    for (int row = 0; row < ny; row++) {

        for (int i = 0; i < 2 * nx + 1; i++) {

            // continuity equation
            if (i % 2 == 1) {

                int col = (i - 1) / 2;
                int idx = i + (2 * nx + 1) * row;
                int i0 = col + row * nx;
                row_ptrs[idx] = nnz;

                // open boundary
                if (ndd[i0] == 2) {

                    col_idxs[nnz++] = idx;

                }
                // inner points
                else {

                    col_idxs[nnz++] = idx - 1;
                    col_idxs[nnz++] = idx;
                    col_idxs[nnz++] = idx + 1;

                }

            }

            // momentum equation
            if (i % 2 == 0) {

                int col = i / 2;
                int idx = i + (2 * nx + 1) * row;
                int i0 = col + row * nx;
                row_ptrs[idx] = nnz;

                if (col > 0 && col < nx) {

                    col_idxs[nnz++] = idx - 2;
                    col_idxs[nnz++] = idx - 1;
                    col_idxs[nnz++] = idx;
                    col_idxs[nnz++] = idx + 1;
                    col_idxs[nnz++] = idx + 2;

                }
                else if (col == 0) {

                    if (ndd[i0] == 2) {

                        printf("idx = %d, nnz = %d\n", idx, nnz);

                        col_idxs[nnz++] = idx;
                        col_idxs[nnz++] = idx + 2;
                        col_idxs[nnz++] = idx + 4;

                    }
                    else {

                        col_idxs[nnz++] = idx;

                    }
                }
                else if (col == nx) {

                    if (ndd[i0 - 1] == 2) {

                        col_idxs[nnz++] = idx - 4;
                        col_idxs[nnz++] = idx - 2;
                        col_idxs[nnz++] = idx;

                    }
                    else {

                        col_idxs[nnz++] = idx;

                    }

                }

            }

        }

    }

    row_ptrs[2 * n + ny] = nnz;

    FILE* fp;

    // loop over 100
    for (int ik = 0; ik < 1; ik++) {

        continuityXdirKernel(n, nx, ny, dx, ndd, eta, u, v, h, detadt, (double*)values, rhs, row_ptrs);

        momentumXdirKernel(n, nx, ny, dx, gra, mann, Du, ndd, eta, u, v, h, (double*)values, rhs, row_ptrs);

        ///////////////////////////////////////////////////////
        ///////////////////////////////////////////////////////

        fp = fopen("values.bin", "wb");
        fwrite(values, sizeof(double), 6 * (2 * n + ny), fp);
        fclose(fp);

        fp = fopen("rhs.bin", "wb");
        fwrite(rhs, sizeof(double), (2 * n + ny), fp);
        fclose(fp);

        fp = fopen("row_ptrs.bin", "wb");
        fwrite(row_ptrs, sizeof(int), (2 * n + ny + 1), fp);
        fclose(fp);

        fp = fopen("col_idxs.bin", "wb");
        fwrite(col_idxs, sizeof(int64_t), 6 * (2 * n + ny), fp);
        fclose(fp);

        FILE* fp2;
        // write the matrix to a text file
        fp2 = fopen("values.txt", "w");
        for (int i = 0; i < 6 * (2 * n + ny); i++) {
            fprintf(fp2, "%f\n", ((double*)values)[i]);
        }
        fclose(fp2);

        fp2 = fopen("rhs.txt", "w");
        for (int i = 0; i < 2 * n + ny; i++) {
            fprintf(fp2, "%f\n", ((double*)rhs)[i]);
        }
        fclose(fp2);

        fp2 = fopen("row_ptrs.txt", "w");
        for (int i = 0; i < 2 * n + ny + 1; i++) {
            fprintf(fp2, "%d\n", row_ptrs[i]);
        }
        fclose(fp2);

        fp2 = fopen("col_idxs.txt", "w");
        for (int i = 0; i < 6 * (2 * n + ny); i++) {
            fprintf(fp2, "%d\n", col_idxs[i]);
        }
        fclose(fp2);

        AMGX_matrix_upload_all_global(A,
            2 * n + ny, 2 * n + ny, nnz, 1, 1,
            row_ptrs, col_idxs, values, NULL,
            nrings, nrings, NULL);

        /* generate the rhs and solution */

        //for (int i = 0; i < 2 * n + ny; i++) { 
        //    if (i % 2 == 0) {
        //        ((double*)h_x)[i] = 0.0;
        //    }
        //    else {
        //        ((double*)h_x)[i] = 0.0;
        //    }   
        //}

        /* set the connectivity information (for the vector) */
        AMGX_vector_bind(x, A);
        AMGX_vector_bind(b, A);
        /* upload the vector (and the connectivity information) */
        AMGX_vector_upload(x, 2 * n + ny, 1, h_x);
        AMGX_vector_upload(b, 2 * n + ny, 1, rhs);
        /* solver setup */
        //MPI barrier for stability (should be removed in practice to maximize performance)
        MPI_Barrier(amgx_mpi_comm);
        AMGX_solver_setup(solver, A);
        /* solver solve */
        //MPI barrier for stability (should be removed in practice to maximize performance)
        MPI_Barrier(amgx_mpi_comm);
        AMGX_solver_solve(solver, b, x);
        /* example of how to change parameters between non-linear iterations */
        //AMGX_config_add_parameters(&cfg, "config_version=2, default:tolerance=1e-12");
        //AMGX_solver_solve(solver, b, x);
        /* example of how to replace coefficients between non-linear iterations */
        //AMGX_matrix_replace_coefficients(A, n, nnz, values, diag);
        //AMGX_solver_setup(solver, A);
        //AMGX_solver_solve(solver, b, x);
        AMGX_solver_get_status(solver, &status);
        /* example of how to get (the local part of) the solution */
        //int sizeof_v_val;
        //sizeof_v_val = ((NVAMG_GET_MODE_VAL(NVAMG_VecPrecision, mode) == NVAMG_vecDouble))? sizeof(double): sizeof(float);
        // 
        void* SLN = malloc((2 * n + ny) * sizeof(double));
        AMGX_vector_download(x, SLN);

        // update SLN to eta
        for (int i = 0; i < n; i++) {
            int row = i / nx;                               // row index in the 2D work array       
            int col = i % nx;                               // col index in the 2D work array   
            int idx = 2 * col + 1 + (2 * nx + 1) * row;     // pivot index in the row of sparse matrix
            eta[i] = eta[i] + ((double*)SLN)[idx];
        }

        // update SLN to u
        for (int i = 0; i < n + ny; i++) {
            int row = i / (nx + 1);                               // row index in the 2D work array       
            int col = i % (nx + 1);                               // col index in the 2D work array   
            int idx = 2 * col + (2 * nx + 1) * row;         // pivot index in the row of sparse matrix
            u[i] = u[i] + ((double*)SLN)[idx];
        }

    }

    //write u and eta to binary file
    fp = fopen("u.bin", "wb");
    fwrite(u, sizeof(double), (n + ny), fp);
    fclose(fp);
    // write eta to binry file
    fp = fopen("eta.bin", "wb");
    fwrite(eta, sizeof(double), n, fp);
    fclose(fp);

    free(values);
    free(row_ptrs);
    free(col_idxs);

    //free(result_host);
    /* destroy resources, matrix, vector and solver */
    AMGX_solver_destroy(solver);
    AMGX_vector_destroy(x);
    AMGX_vector_destroy(b);
    AMGX_matrix_destroy(A);
    AMGX_resources_destroy(rsrc);
    /* destroy config (need to use AMGX_SAFE_CALL after this point) */
    AMGX_SAFE_CALL(AMGX_config_destroy(cfg))
        /* shutdown and exit */
        AMGX_SAFE_CALL(AMGX_finalize_plugins())
        AMGX_SAFE_CALL(AMGX_finalize())
        /* close the library (if it was dynamically loaded) */
#ifdef AMGX_DYNAMIC_LOADING
        amgx_libclose(lib_handle);
#endif
    MPI_Finalize();
    CUDA_SAFE_CALL(cudaDeviceReset());
    return status;
}

Additional context

Add any other context about the problem here.

marsaev commented 4 months ago

Hello @fxuecnu ,

To explain the difference between those two runs - In the first case with 1 AMG level, AMGX uses direct solver for the coarsest level (which is the only level), which solves system directly. However in the second case, It seems that multigrid operators doesn't perform well - it might be related to the fact that off-diagonal elements are significantly larger than diagonal ones. Do you see similar behaviour with larger matrices? If you need to solve only such small matrices I would recommend direct solve approach (i.e. factorize + solve)

fxuecnu commented 4 months ago

Hello @marsaev

Thank you for your response. Indeed, my matrices exhibit the scenario you described, with very small diagonal elements, resulting in convergence issues even with larger matrices. I need to solve extremely large matrices up to the size of 4 million by 4 million. I have tried direct solving, but it is too slow. Currently, I upload the matrices to AMGX with a block size of 1x1. I could upload the elements in 3x3 blocks to ensure diagonal dominance at the block level. Does AMGX have a suitable algorithm to handle this approach? Additionally, are there any other methods within AMGX that could leverage its high performance capabilities for my situation?

fxuecnu commented 4 months ago

Hello marsaev

Thank you for your response. Indeed, my matrices exhibit the scenario you described, with very small diagonal elements, resulting in convergence issues even with larger matrices. I need to solve extremely large matrices up to the size of 4 million by 4 million. I have tried direct solving, but it is too slow. Currently, I upload the matrices to AMGX with a block size of 1x1. I could upload the elements in 3x3 blocks to ensure diagonal dominance at the block level. Does AMGX have a suitable algorithm to handle this approach? Additionally, are there any other methods within AMGX that could leverage its high performance capabilities for my situation?

best, Fan

徐凡 副研究员 华东师范大学 @.*** 18851122579 上海市东川路500号河口海岸大楼

签名由 网易灵犀办公 定制

Original: From:marsaev @.>Date:2024-07-04 07:45:12(中国 (GMT+08:00))To:NVIDIA/AMGX @.>Cc:fxuecnu @.> , Mention @.>Subject:Re: [NVIDIA/AMGX] [Issue]solution depends on the number of levels of AMG grid (Issue #305) Hello @fxuecnu , To explain the difference between those two runs - In the first case with 1 AMG level, AMGX uses direct solver for the coarsest level (which is the only level), which solves system directly. However in the second case, It seems that multigrid operators doesn't perform well - it might be related to the fact that off-diagonal elements are significantly larger than diagonal ones. Do you see similar behaviour with larger matrices? If you need to solve only such small matrices I would recommend direct solve approach (i.e. factorize + solve) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

marsaev commented 4 months ago

@fxuecnu How would you solve this system on CPU? You might want to try GMRES type of solver, but maybe multigrid wouldn't be a best preconditioner in this case.

If direct sparse solve works for you but you find it slow, I can also suggest trying https://developer.nvidia.com/cudss