NVIDIA / AMGX

Distributed multigrid linear solver library on GPU
502 stars 143 forks source link

Does AMGX support systems that have more than 2B unknowns? #145

Open stgeke opened 3 years ago

marsaev commented 3 years ago

Code instantiates only int as a index type for a rank partition or a single gpu matrix. However global matrix (spanned across multiple ranks) is indexed with int64 global indices. col_indices_global in AMGX_matrix_upload_all_global() API is assumed to be of int64_t type.

stgeke commented 3 years ago

Looking at

https://github.com/NVIDIA/AMGX/blob/main/base/include/amgx_c.h#L570

It seems like nglobal is int?

On 23 May 2021, at 13:36, marsaev @.***> wrote:

 Code instantiates only int as a index type for a rank partition or a single gpu matrix. However global matrix (spanned across multiple ranks) is indexed with int64 global indices. col_indices_global in AMGX_matrix_upload_all_global() API is assumed to be of int64_t type.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

marsaev commented 3 years ago

Good catch. I actually surprised that it's an int, i'm pretty sure we specifically had a case with >2B rows some time ago. We plan for a few API changes this summer, will test and incorporate 64bit number of rows.

MalachiTimothyPhillips commented 3 years ago

Any updates concerning this?

MalachiTimothyPhillips commented 2 years ago

Any updates concerning this?

Bump

mattmartineau commented 2 years ago

I'm working on some changes to the way that types are selected in AmgX as part of a larger piece of work to reduce some of the barriers to development of AmgX, while generally improving the user experience. The lack of support for 64-bit integer row counts is something I have also encountered in projects I have accelerated recently with AmgX, and see as a relatively high priority.

As such, I can confirm this is something I will address this year unless someone else gets there first.

pledac commented 1 year ago

Hello,

Just to note: to check the ability of AmgX to solve systems with more than 2B rows, I ran the amgx_mpi_poisson7 test. It runs with (2 048 000 000 DOF) on 64 GPUs (A100 with 80GB memory):

srun -n 64 amgx_mpi_poisson7 -mode dDDI -p 200 400 400 4 4 4 -c config
AMG Grid:
         Number of Levels: 7
            LVL         ROWS               NNZ  PARTS    SPRSTY       Mem (GB)
        ----------------------------------------------------------------------
           0(D)   1024000000        7160320000     32  6.83e-09            109
           1(D)    321638311       18106362477     32  1.75e-07            436
           2(D)     35075078        6134541300     32  4.99e-06            158
           3(D)      2230884         820160005     32  0.000165           27.2
           4(D)       129489          74410282     32   0.00444           4.77
           5(D)         6731           4243789     32    0.0937          0.874
           6(D)          441            122160     32     0.628         0.0717
         ----------------------------------------------------------------------
         Grid Complexity: 1.35066
         Operator Complexity: 4.51099
         Total Memory Usage: 736.832 GB
         ----------------------------------------------------------------------
           iter      Mem Usage (GB)       residual           rate
         ----------------------------------------------------------------------
            Ini             51.1258   3.200000e+04
              0             51.1258   6.606039e+05        20.6439
              1             51.1258   1.738751e+05         0.2632
              2             51.1258   9.409743e+04         0.5412
              3             51.1258   2.634899e+04         0.2800
              4             51.1258   9.909617e+03         0.3761
              5             51.1258   3.547899e+03         0.3580
              6             51.1258   1.150241e+03         0.3242
              7             51.1258   3.814637e+02         0.3316
              8             51.1258   1.296999e+02         0.3400
              9             51.1258   4.284031e+01         0.3303
             10             51.1258   1.477877e+01         0.3450
             11             51.1258   4.634529e+00         0.3136
             12             51.1258   1.435099e+00         0.3097
             13             51.1258   4.461860e-01         0.3109
             14             51.1258   1.468895e-01         0.3292
             15             51.1258   4.984580e-02         0.3393
             16             51.1258   1.533196e-02         0.3076
             17             51.1258   5.271090e-03         0.3438
             18             51.1258   1.782666e-03         0.3382
         ----------------------------------------------------------------------
         Total Iterations: 19
         Avg Convergence Rate:                   0.4152
         Final Residual:                   1.782666e-03
         Total Reduction in Residual:      5.570831e-08
         Maximum Memory Usage:                   51.126 GB
         ----------------------------------------------------------------------
Total Time: 20.3998
    setup: 19.4673 s
    solve: 0.932517 s
    solve(per iteration): 0.0490799 s

... But crashed when doubling the mesh size (4 096 000 000 DOF):

srun -n 64 amgx_mpi_poisson7 -mode dDDI -p 400 400 400 4 4 4 -c config
AMGX ERROR: file /ccc/scratch/cont002/den/ledacp/trust/amgx_openmp_int64/ThirdPart/src/LIBAMGX/AmgX/src/amgx_c.cu line   2755
AMGX ERROR: Thrust failure.

With config:

config_version=2
solver(pcgf)=PCG
determinism_flag=1
pcgf:preconditioner(prec)=AMG
pcgf:use_scalar_norm=1
pcgf:max_iters=10000
pcgf:convergence=RELATIVE_INI_CORE
pcgf:tolerance=1e-7
pcgf:norm=L2
pcgf:print_solve_stats=1
pcgf:monitor_residual=1
pcgf:obtain_timings=1
prec:error_scaling=0
prec:print_grid_stats=1
prec:max_iters=1
prec:cycle=V
prec:min_coarse_rows=2
prec:max_levels=100
prec:smoother(smoother)=BLOCK_JACOBI
prec:presweeps=1
prec:postsweeps=1
prec:coarsest_sweeps=1
prec:coarse_solver(c_solver)=DENSE_LU_SOLVER
prec:dense_lu_num_rows=2
prec:algorithm=CLASSICAL
#prec:selector=HMIS
# Much faster for setup:
prec:selector=PMIS
prec:interpolator=D2
prec:strength=AHAT
smoother:relaxation_factor=0.8

I am using last AmgX version, so I guess no progress there unhappily.

Thanks

pledac commented 1 year ago

Oups, it works with:

srun -n 80 amgx_mpi_poisson7 -mode dDDI -p 300 400 400 5 4 4 -c config
         Number of Levels: 6
            LVL         ROWS               NNZ  PARTS    SPRSTY       Mem (GB)
        ----------------------------------------------------------------------
           0(D)   2560000000       17908480000     80  2.73e-09            274
           1(D)    793456455       44794412303     80  7.12e-08       1.09e+03
           2(D)     87410454       15526788091     80  2.03e-06            413
           3(D)      5507949        2071390723     80  6.83e-05           74.4
           4(D)       315703         193178409     80   0.00194           14.5
           5(D)        15796          12171321     80    0.0488           3.82
         ----------------------------------------------------------------------
         Grid Complexity: 1.34637
         Operator Complexity: 4.49544
         Total Memory Usage: 1871.14 GB
         ----------------------------------------------------------------------
           iter      Mem Usage (GB)       residual           rate
         ----------------------------------------------------------------------
            Ini             51.1425   5.059644e+04
              0             51.1425   1.742094e+06        34.4312
              1             51.1425   1.468481e+06         0.8429
              2             51.1425   5.676792e+05         0.3866
              3             51.1425   2.467121e+05         0.4346
              4             51.1425   6.569428e+04         0.2663
              5             51.1425   1.236653e+04         0.1882
              6             51.1425   3.125412e+03         0.2527
              7             51.1425   2.090049e+03         0.6687
              8             51.1425   7.675394e+02         0.3672
              9             51.1425   2.230874e+02         0.2907
             10             51.1425   5.848744e+01         0.2622
             11             51.1425   1.706309e+01         0.2917
             12             51.1425   6.254182e+00         0.3665
             13             51.1425   1.529237e+00         0.2445
             14             51.1425   3.225381e-01         0.2109
             15             51.1425   1.107974e-01         0.3435
             16             51.1425   4.094373e-02         0.3695
             17             51.1425   1.333030e-02         0.3256
             18             51.1425   6.989986e-03         0.5244
             19             51.1425   2.073903e-03         0.2967
         ----------------------------------------------------------------------
         Total Iterations: 20
         Avg Convergence Rate:               0.4272
         Final Residual:           2.073903e-03
         Total Reduction in Residual:      4.098912e-08
         Maximum Memory Usage:               51.143 GB
         ----------------------------------------------------------------------
Total Time: 25.8384
    setup: 24.7648 s
    solve: 1.0736 s
    solve(per iteration): 0.0536801 s

So, 80x300x400x400=2.56B unknowns which is greater than 2^31.

So, good new and now I naively ask about the upper limit of AmgX ? Thanks to everyone we could answer.

marsaev commented 1 year ago

upper limit of AmgX ? There are two types of limitations - software (API parameters ranges) and hardware (memory capacity to fit matrix). For the first it's limited by the parameters types - if something is not enough for your case, let us know. Was your initial issue partially/fully fixes by change mentioned in this issue? For the second it's hard to provide good estimations on peak memory usage for multigrid - it's very case and configuration dependent. Trial and error will give you better picture here.

pledac commented 1 year ago

Ok, thanks marsaev for your answer. I understand that one of the limitation is memory capacity of each device (which probably caused the crash for the 400x400x400=64e6 cells mesh). I did this test, cause I fear according to @mattmartineau comment that AmgX could have some issues with number of rows bigger than 2B. It seems not to be the case. I will confirm with my code which use AmgXWrapper + AmgX, and I needed to fix some 64-bit integer issues in AmgXWrapper for now.

I'm working on some changes to the way that types are selected in AmgX as part of a larger piece of work to reduce some of the barriers to development of AmgX, while generally improving the user experience. The lack of support for 64-bit integer row counts is something I have also encountered in projects I have accelerated recently with AmgX, and see as a relatively high priority.

As such, I can confirm this is something I will address this year unless someone else gets there first.