NVIDIA / AMGX

Distributed multigrid linear solver library on GPU
482 stars 139 forks source link

[MultiGPU] No convergence for Classical AMG with Cuda version>11.2 in my code when linked with AmgX (not reproduced in AmgX standalone) #251

Closed pledac closed 1 week ago

pledac commented 1 year ago

We are using AmgX (through AmgXWrapper) for 2 years now, but we are facing an annoying issue. Our code runs fine with CG solver and Aggegated or Classical AMG preconditioner, in parallel with one or more GPU.

But with Cuda version>11.2, whereas Aggregated AMG stills works, Classical AMG fails to converge with n GPUs (n>1):

Using Normal MPI (Hostbuffer) communicator...
AMG Grid:
         Number of Levels: 7
            LVL         ROWS               NNZ  PARTS    SPRSTY       Mem (GB)
        ----------------------------------------------------------------------
           0(D)       125000            860000      2   5.5e-05         0.0139
           1(D)        62112            889140      2   0.00023         0.0234
           2(D)        30617           1911169      2   0.00204         0.0502
           3(D)         7452           1495312      2    0.0269         0.0429
           4(D)         1051            217215      2     0.197        0.00837
           5(D)           81              6045      2     0.921       0.000277
           6(D)           11               121      2         1       6.12e-06
         ----------------------------------------------------------------------
         Grid Complexity: 1.81059
         Operator Complexity: 6.25465
         Total Memory Usage: 0.139052 GB
         ----------------------------------------------------------------------
           iter      Mem Usage (GB)       residual           rate
         ----------------------------------------------------------------------
            Ini               1.742   1.200000e-06
              0               1.742   1.940205e-06         1.6168
              1              1.7420   1.942415e-06         1.0011
              2              1.7420   1.942423e-06         1.0000
              3              1.7420   1.942423e-06         1.0000
              4              1.7420   1.942423e-06         1.0000
              5              1.7420   1.942423e-06         1.0000
              6              1.7420   1.942423e-06         1.0000
              7              1.7420   1.942423e-06         1.0000
              8              1.7420   1.942423e-06         1.0000
              9              1.7420   1.942423e-06         1.0000
             10              1.7420   1.942423e-06         1.0000
             11              1.7420   1.942423e-06         1.0000
...

Here is the config file used:

# AmgX config file
config_version=2
solver(s)=PCG
s:convergence=ABSOLUTE
s:tolerance=1.000000e-20
s:preconditioner(p)=AMG
s:use_scalar_norm=1
p:error_scaling=0
p:print_grid_stats=1
p:max_iters=1
p:cycle=V
p:min_coarse_rows=2
p:max_levels=100
p:smoother(smoother)=BLOCK_JACOBI
p:presweeps=1
p:postsweeps=1
p:coarsest_sweeps=1
p:coarse_solver=DENSE_LU_SOLVER
p:dense_lu_num_rows=2
p:algorithm=CLASSICAL
p:selector=HMIS
p:interpolator=D2
p:strength=AHAT
smoother:relaxation_factor=0.8
s:print_config=1
s:store_res_history=1
s:monitor_residual=1
s:print_solve_stats=1
s:obtain_timings=1
s:max_iters=10000
determinism_flag=1

Unhappily, I can't reproduce the issue with AmgX or AmgXWrapper samples.

Do anyone notice too this issue ?

Thanks

pledac commented 1 year ago

The issue seems to come from cuSPARSE library, cause with Cuda>11.2 and using only 11.2 version for libcusparse, it works. So putting libcusparse.so.11.3.1.68 (as libcusparse.so.11) along libamgxsh.so in the same directory is a quick fix for me now, as I really need classical AMG.

mattmartineau commented 1 year ago

Did you happen to try PMIS instead of HMIS?

I would be interested in hearing about your use case. Would you be happy to start a private email thread?

pledac commented 1 year ago

Thanks Matt for the replay, ready to discuss privately and share my use case. In the same time, I will try PMIS with Classical AMG. I forgot to say I tried 2.2.0, 2.3.0 and very last main AmgX version without success to fix this issue.

pledac commented 1 year ago

Did you happen to try PMIS instead of HMIS?

PMIS is same than HMIS for this issue.

marsaev commented 1 year ago

Unhappily, I can't reproduce the issue with AmgX

You mean that solving same matrix using same solver configuration using one of the examples yield different result? First it would be great to confirm that matrix is partitioned as expected using the way you upload it to AMGX (i.e. comparing to AMGX example). On the same time it would be great to try to simplify solver config to the point when the result will match the output of version with cusparse 11.2 - this will help narrow down where something might went wrong. (i.e. reduce number of levels up to 2, try changing smoothers, other solver parameters)

pledac commented 1 year ago

I mean that: a) The issue (C-AMG with MultiGPU on Cuda>11.2) happens in my code with every different kind of matrix b) I can't reproduce the issue when providing with the same config a matrix to solveFromFiles (AmgXWrapper tool) which call AmgX.

-> So I guess something is slightly different in the way my code is built with AmgX/Cuda>11.2 compared to the build of AmgX or AmgXWrapper alone, as neither AmgX nor AmgXWrapper have the same issue in my tests. So I change the title to clarify.

I will try to reduce solver config, thanks.

marsaev commented 1 year ago

What alarms me is that, if i understand correctly, changing cusparse library changes the behaviour.

I can't reproduce the issue when providing with the same config a matrix to solveFromFiles (AmgXWrapper tool) which call AmgX. Other than config it's important to match how matrices are distributed across ranks - this is likely trigger different paths. Do you use AmgXWrapper in your code too? What API do you use to upload matrix to the GPU?

pledac commented 1 year ago

What alarms me is that, if i understand correctly, changing cusparse library changes the behaviour.

Yes using CUDA 11.2 cusparse (with LD_LIBRARY_PATH) CHANGES the behaviour in my case, I confirm. It is my only solution for the moment.

I can't reproduce the issue when providing with the same config a matrix to solveFromFiles (AmgXWrapper tool) which call AmgX. Other than config it's important to match how matrices are distributed across ranks - this is likely trigger different paths. Do you use AmgXWrapper in your code too? What API do you use to upload matrix to the GPU?

Yes I am using AmgXWrapper in my code with the new API from Matt to upload matrix under CSR format:

...
  petscToCSR(MatricePetsc_, SolutionPetsc_, SecondMembrePetsc_);
  SolveurAmgX_.setA(nRowsGlobal, nRowsLocal, nNz, rowOffsets, colIndices, values, nullptr);
...
  SolveurAmgX_.solve(lhs, rhs, nRowsLocal);
pledac commented 1 year ago

To be sure, I just run again the poisson AmgXWrapper test with AmgX_CSR API on 2 GPUs with CUDA 11.4 and my configfile (with Classical AMG), and it works fine.

So, there is something wrong in my code, which only produces an issue for cuSPARSE>11.2... It has been teasing me for more than one year :-(

I will think about your sentence: "it's important to match how matrices are distributed across ranks - this is likely trigger different paths"

marsaev commented 1 year ago

Just to note - it's possible that cusparse internal implementation also did change. If it produced a regression or a bug - it would be great to try catching it. From cusparse only two things are used in the AMGX really (and also only conditionally) - SpMV and SpMM. To remove spmv from suspicion, if it's impossible to export matrix - one thing you can try comparing standalone SPMV on your matrix specifically - using AMGX_matrix_vector_multiply API (example here: https://github.com/NVIDIA/AMGX/blob/main/examples/amgx_spmv_test.c, but replace https://github.com/NVIDIA/AMGX/blob/main/examples/amgx_spmv_test.c#L264 to CLASSICAL to build classical solver-like redistribution). There is no interface to that function in AMGXWrapper, but it can be added there. Then you can compare CUDA <= 11.2 vs CUDA > 11.2. If result will be different - then it's something we can work further with cuSparse team. If everything will seem normal - we can think about what could be wrong with SpMM, but debugging this will need more effort.

pledac commented 1 year ago

Thanks, I will have a look and experiment with the amgx_spmv_test.c as soon as I get some time and will report.

pledac commented 1 week ago

Just to say, the issue has gone with my code. And can't say why :-)