Distributed multigrid linear solver library on GPU
493 stars 143 forks source link

new version gives error on parsing *.JSON inputs (cuda/10.2.89) #99

Closed Jaberh closed 4 years ago

Jaberh commented 4 years ago

I am trying to integrate AMGX to an industrial code. if I build in in debug mode, there are no parsing issues, but when in release mode, it gives the following error; the input file is AGGREGATION_JACOBI.json Converting config string to current config version Error parsing parameter string: Incorrect config entry (number of equal signs is not 1) : "config_version": 2 To make stuff more interesting, if I use AMGX_config_create(&m_config, "config_version=2,algorithm=AGGREGATION,selector=SIZE_2,print_grid_stats=1,max_iters=1000,monitor_residual=1,obtain_timings=1,print_solve_stats=1,print_grid_stats=1"); it refuses to print grid and solver stats but accepts the rest of the inputs.

marsaev commented 4 years ago

Hi @Jaberh , For every config (file or string) it first tries to parse it as JSON and if it fails then it tries it to process as 'plain text' config (see https://github.com/NVIDIA/AMGX/blob/master/doc/AMGX_Reference.pdf). I strongly recommend sticking to the JSON configs as they are simpler to handle.

Converting config string to current config version Error parsing parameter string: Incorrect config entry (number of equal signs is not 1) : "config_version": 2

It seems that it cannot parse your config as JSON for some reason and then it cannot parse it as plain text config (because it is not) which results in fail. Now it's weird that Release/Debug differs in such manner. Which host compiler do you use? Do you wish to use config files or rather config strings in your final code?

To make stuff more interesting, if I use AMGX_config_create(&m_config, "config_version=2,algorithm=AGGREGATION,selector=SIZE_2,print_grid_stats=1,max_iters=1000,monitor_residual=1,obtain_timings=1,print_solve_stats=1,print_grid_stats=1"); it refuses to print grid and solver stats but accepts the rest of the inputs.

This happens because AMGX parsed this config as plain text and it has slightly different syntax. Grid and solver stats parameter need to be explicitly specified for a solver. Again, i strongly recommend using JSON configs, but if you wish i can explain to you how to use plain text configs.

Jaberh commented 4 years ago

Hi Marsaev, my first choice was also using the JSON file, the issue is when I integrate the amgx to the bigger code, using JSON file it does not iterate, I used the config file from your example, stand alone works perfect adding to the bigger code using JSON number of iterations kept at 0. Using the same JSON file that works as stand alone, I think by now I have almost memorized your entire AMGX_Reference.pdf file. I am using gcc/8.2.0 with mpich/3.2.1 for mpi and cuda/10.2.89

marsaev commented 4 years ago

You can also provide JSON as a string. Here is example of extension of amgx_capi example that reads json file and provide it's contents to AMGX_config_create() function, it works identical to using AMGX_config_create_from_file() for me:

        FILE* fjson = fopen(argv[pidz + 1], "r");
        const size_t config_max_len = 4096;
        char config[config_max_len];

        char* read_ptr = config;
        size_t len = config_max_len - 1;

        while(fgets(read_ptr, len, fjson)) {
            len = MAX(config_max_len - strlen(config) - 1, 0);
            read_ptr = config + strlen(config);

        AMGX_SAFE_CALL(AMGX_config_create(&cfg, config));

You can also hardcode config, for example here i copy-pasted AGGREGATION_JACOBI.json config and sent it to AMGX_config create:

        const char config[] = 
            "{                                                  "        
            "    \"config_version\": 2,                         "                                    
            "    \"determinism_flag\": 1,                       "                                    
            "    \"solver\": {                                  "                        
            "        \"print_grid_stats\": 1,                   "                                        
            "        \"algorithm\": \"AGGREGATION\",            "                                                
            "        \"obtain_timings\": 1,                     "                                        
            "        \"solver\": \"AMG\",                       "                                    
            "        \"smoother\": \"BLOCK_JACOBI\",            "                                                
            "        \"print_solve_stats\": 1,                  "                                        
            "        \"presweeps\": 2,                          "                                
            "        \"selector\": \"SIZE_2\",                  "                                        
            "        \"convergence\": \"RELATIVE_MAX_CORE\",    "                                                        
            "        \"coarsest_sweeps\": 2,                    "                                        
            "        \"max_iters\": 100,                        "                                    
            "        \"monitor_residual\": 1,                   "                                        
            "        \"min_coarse_rows\": 2,                    "                                        
            "        \"relaxation_factor\": 0.75,               "                                            
            "        \"scope\": \"main\",                       "                                    
            "        \"max_levels\": 1000,                      "                                    
            "        \"postsweeps\": 2,                         "                                    
            "        \"tolerance\": 0.1,                        "                                    
            "        \"norm\": \"L1\",                          "                                
            "        \"cycle\": \"V\"                           "                                
            "    }                                              "            
            "}                                                  ";

        AMGX_SAFE_CALL(AMGX_config_create(&cfg, config));

and output is also identical to what's above. For everything here i used same gcc and Release build.

Would any of those options work for you in your app?

Jaberh commented 4 years ago

Interesting, I copied your hard coded version and still the error is the same as before Error parsing parameter string: Incorrect config entry (number of equal signs is not 1) : { "config_version": 2

AMGX ERROR: file *.cu line 215 AMGX ERROR: Incorrect amgx configuration provided. The only different think that I do is I dont manually do the "lib_handle = amgx_libopen("libamgxsh.so");" although I use the dynamic lib, this I dont think should matter, But all else is pretty much according to the manual. using the debug version reads successfully but does zero iterations, here is the output AMG Grid: Number of Levels: 6 LVL ROWS NNZ SPRSTY Mem (GB)

       0(D)        10000             49600  0.000496       0.000848
       1(D)         4717             27857   0.00125        0.00083
       2(D)         2145             13539   0.00294       0.000397
       3(D)          990              6506   0.00664       0.000189
       4(D)          456              3002    0.0144       8.73e-05
       5(D)          213              1387    0.0306       3.74e-05
     Grid Complexity: 1.8521
     Operator Complexity: 2.05425
     Total Memory Usage: 0.00238962 GB
       iter      Mem Usage (GB)       residual           rate
        Ini            0.909485   2.199999e+03
     Total Iterations: 0
     Avg Convergence Rate:                 1.000000
     Final Residual:                   2.199999e+03
     Total Reduction in Residual:      1.000000e+00
     Maximum Memory Usage:                    0.909 GB

Total Time: 0.00571098 setup: 0.00561062 s solve: 0.000100352 s solve(per iteration): 0 s

marsaev commented 4 years ago

Just to check, when you build Release AMGX, can you check that RAPIDJSON_DEFINED is in the build C flags? (make VERBOSE=1 for any AMGX library file)

Jaberh commented 4 years ago

I will check it now, but I remember the CMakeLists.txt enforces it to one, I will double check again, btw spasiba for all your help, We did a fresh build, now for a simple test case the first option works and the second one still gives the same error. The only change that I made to your config file is to add "store_res_history": 1, now with your proposed alternatives, and the new build, I do not get the read error Howerver, using any JSON file from your examples I get such as the following { "config_version": 2, "determinism_flag": 1, "solver": { "print_grid_stats": 1, "algorithm": "AGGREGATION", "obtain_timings": 1, "solver": "AMG", "smoother": "BLOCK_JACOBI", "print_solve_stats": 1, "presweeps": 2, "selector": "SIZE_2", "convergence": "RELATIVE_MAX_CORE", "coarsest_sweeps": 2, "max_iters": 100, "monitor_residual": 1, "min_coarse_rows": 2, "relaxation_factor": 0.75, "scope": "main", "max_levels": 1000, "postsweeps": 2, "tolerance": 0.1, "norm": "L1", "cycle": "V", "store_res_history": 1 } } ~ I get zero iterations AMG Grid: Number of Levels: 7 LVL ROWS NNZ SPRSTY Mem (GB)

       0(D)        12831             87115  0.000529        0.00135
       1(D)         5900             52052    0.0015        0.00142
       2(D)         2770             29606   0.00386       0.000785
       3(D)         1314             15616   0.00904       0.000407
       4(D)          621              7745    0.0201       0.000201
       5(D)          297              3695    0.0419       9.58e-05
       6(D)          140              1658    0.0846       4.13e-05
     Grid Complexity: 1.86057
     Operator Complexity: 2.26697
     Total Memory Usage: 0.00430242 GB
       iter      Mem Usage (GB)       residual           rate
        Ini            0.909485   1.071363e+05
     Total Iterations: 0
     Avg Convergence Rate:                 1.000000
     Final Residual:                   1.071363e+05
     **Total Reduction in Residual:      1.000000e+00**
     Maximum Memory Usage:                    0.909 GB

Total Time: 0.0183321 setup: 0.0182201 s solve: 0.000112064 s solve(per iteration): 0 s

However if I use,
AMGX_config_create(&m_config,"config_version=2, solver(s1)=FGMRES, s1:preconditioner=BLOCK_JACOBI ,s1:max_iters=100,s1:convergence=RELATIVE_INI_CORE ,s1:norm=L2, s1:tolerance=1e-3 ,s1:monitor_residual=1,s1:gmres_n_restart=20"); It iterates and converges as follows; CLASSICAL is not supported in AMGX_read_system_maps_one_ring. (this is also interesting as for the FGMRES there are no AMG's to chose from Classical or aggregation) res( 1 )11.2807 res( 2 )6.69336 res( 3 )4.74958 res( 4 )3.69872 res( 5 )2.9753 res( 6 )2.25479 res( 7 )1.69783 res( 8 )1.29488 res( 9 )0.989278 res( 10 )0.742286 res( 11 )0.579798 res( 12 )0.47657 res( 13 )0.311112 res( 14 )0.195809 Which definitely reduces the residual, again stand alone both work perfectly, the problem occurs when I integrate it to the real code. I hope this helps with tracking down the issue, I can also talk on zoom or whatever you prefer if you think It might help with the issue. To me it seems that for whatever reason the the iteration kernel is not being launched

Jaberh commented 4 years ago

did you get a chance to look at the above issue?

marsaev commented 4 years ago

Hey Jaberth,

I suspect none of your configs are parsed as expected. It would definitely help to print exact solver components during solver constructions just to confirm solver structure but there is no such functionality at the moment. Can we try identify issue step by step? Can you try running built-in example amgx_capi on the 2cubes_sphere.mtx matrix from https://suitesparse-collection-website.herokuapp.com/MM/Um/2cubes_sphere.tar.gz ? This would be expected output:

$ examples/amgx_capi -m /tmp/2cubes_sphere/2cubes_sphere.mtx -c ../core/configs/AGGREGATION_JACOBI.json 
AMGX version
Built on Jun  5 2020, 16:50:40
Compiled with CUDA Runtime 10.2, using CUDA driver 11.0
Warning: No mode specified, using dDDI by default.
Reading data...
RHS vector was not found. Using RHS b=[1,…,1]^T
Solution vector was not found. Setting initial solution to x=[0,…,0]^T
Finished reading
AMG Grid:
         Number of Levels: 10
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
           0(D)       101492           1647264   0.00016         0.0214
           1(D)        48283            834307  0.000358         0.0208
           2(D)        23049            429889  0.000809         0.0106
           3(D)        11038            221914   0.00182        0.00545
           4(D)         5313            114457   0.00405        0.00279
           5(D)         2557             58789   0.00899        0.00143
           6(D)         1242             29810    0.0193       0.000722
           7(D)          602             14574    0.0402       0.000353
           8(D)          294              6970    0.0806       0.000169
           9(D)          143              3261     0.159       7.72e-05
         Grid Complexity: 1.91161
         Operator Complexity: 2.0405
         Total Memory Usage: 0.0638129 GB
           iter      Mem Usage (GB)       residual           rate
            Ini            0.900269   1.014920e+05
              0            0.900269   1.633251e+03         0.0161
         Total Iterations: 1
         Avg Convergence Rate:               0.0161
         Final Residual:           1.633251e+03
         Total Reduction in Residual:      1.609241e-02
         Maximum Memory Usage:                0.900 GB
Total Time: 0.0139016
    setup: 0.0120689 s
    solve: 0.00183274 s
    solve(per iteration): 0.00183274 s
marsaev commented 4 years ago

For the same config and input data, but for two ranks it should be:

$ mpirun -n 2  examples/amgx_mpi_capi -m /tmp/2cubes_sphere/2cubes_sphere.mtx -c ../core/configs/AGGREGATION_JACOBI.json 
Process 0 selecting device 0
Process 1 selecting device 1
AMGX version
Built on Jun 19 2020, 20:14:42
Compiled with CUDA Runtime 10.2, using CUDA driver 11.0
Warning: No mode specified, using dDDI by default.
Cannot read file as JSON object, trying as AMGX config
Converting config string to current config version
Parsing configuration string: exception_handling=1 ; 
Warning: No mode specified, using dDDI by default.
Using Normal MPI (Hostbuffer) communicator...
Reading matrix dimensions in file: /tmp/2cubes_sphere/2cubes_sphere.mtx
Reading data...
RHS vector was not found. Using RHS b=[1,…,1]^T
Solution vector was not found. Setting initial solution to x=[0,…,0]^T
Finished reading
Using Normal MPI (Hostbuffer) communicator...
Using Normal MPI (Hostbuffer) communicator...
Using Normal MPI (Hostbuffer) communicator...
AMG Grid:
         Number of Levels: 9
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
           0(D)       101492           1647264   0.00016          0.022
           1(D)        48168            832024  0.000359         0.0213
           2(D)        22955            428089  0.000812         0.0109
           3(D)        10967            220573   0.00183         0.0056
           4(D)         5277            113673   0.00408        0.00288
           5(D)         2541             57753   0.00894        0.00147
           6(D)         1221             28739    0.0193       0.000734
           7(D)          588             13920    0.0403        0.00036
           8(D)          285              6605    0.0813       0.000166
         Grid Complexity: 1.9065
         Operator Complexity: 2.03285
         Total Memory Usage: 0.0653328 GB
           iter      Mem Usage (GB)       residual           rate
            Ini            0.902222   1.014920e+05
              0            0.902222   1.637347e+03         0.0161
         Total Iterations: 1
         Avg Convergence Rate:               0.0161
         Final Residual:           1.637347e+03
         Total Reduction in Residual:      1.613277e-02
         Maximum Memory Usage:                0.902 GB
Total Time: 0.0273681
    setup: 0.0240261 s
    solve: 0.00334202 s
    solve(per iteration): 0.00334202 s
Jaberh commented 4 years ago

Hi Marsaev Thank you for the detailed example. I will give this a try. However, the stand alone code using numerous examples and configs works fine. I have verified those about a month ago. The issue is when I integrate the interface and AMGX to a real industrial scale code, for whatever reason, in parallel mode, it does not perform iteration. same config works in the unit test no problem. I am trying to schedule a meeting with NVIDIA through our company as I think it is easier to demonstrate what is going on easier in the meeting. I will let you know once I test this later today

marsaev commented 4 years ago

Is it possible somehow to load this example matrix into your industrial code to check if same solve can be repeated in your app? I.e. if you had something like this on each rank:

AMGX_config_create_from_file( ... ,"AGGREGATION_JACOBI.json")

replace it with:

AMGX_config_create_from_file( ... ,"AGGREGATION_JACOBI.json")

and check that each rank solves this system identically and each rank's AMGX's output is similar to output of standalone example. If your distributed setup is homogeneous, there shouldn't be any major differences (IIRC possible difference - parallel reduction result, but it shouldn't affect solve drastically)

If output is somehow different - can you try log every AMG API call just to check order of calls and parameters? Sorry, there is no built-int logging of API calls, but if you wrap AMGX calls with error checking you can do something like:

#define AMGX_SAFE_CALL(rc) \
    std::cout << #rc << endl;    \
    if (AMGX_RC_OK != (rc)) .....

AMGX_SAFE_CALL( AMGX_config_create_from_file(...) );
Jaberh commented 4 years ago

Hi Marsaev I will work on this today, I have printed out all the return values as per your suggestion from every function, they are all 0's I will try the above matrix asap

marsaev commented 4 years ago

I have printed out all the return values as per your suggestion from every function, they are all 0's

It's great that there are no errors, but it would be great to see actual order of the calls with parameters, hence macro expansion with hashtag #rc

Jaberh commented 4 years ago

here is the output with parameters Warning: using only 1 of 2 available GPUs AMGX_config_create_from_file(&m_config, param_file) AMGX_resources_create(&m_resources, m_config, static_cast<void>(&m_amgx_comm), m_max_device_per_host, (const int)(&m_device_id)) AMGX_matrix_create(&m_matrix, m_resources, m_mode) AMGX_vector_create(&m_rhs, m_resources, m_mode) AMGX_vector_create(&m_solution, m_resources, m_mode) AMGX_solver_create(&m_solver, m_resources, m_mode, m_config) AMGX_matrix_comm_from_maps_one_ring(m_matrix, m_allocated_halo_depth, m_num_nbrs, m_nbrs, m_send_size, m_send_map, m_recv_size, m_recv_map) AMGX_matrix_upload_all(m_matrix, m_n, m_nnz, m_block_dimx, m_block_dimy, (const int)row_ptrs, (const int)col_indices, (const double)data, (const double)diag_data) AMGX_vector_bind(m_rhs, m_matrix) AMGX_vector_bind(m_solution, m_matrix) AMGX_vector_upload(m_rhs, m_n_plus_ghost, m_block_size, rhs) AMGX_vector_upload(m_solution, m_n_plus_ghost, m_block_size, rhs) AMGX_solver_setup(m_solver, m_matrix) AMG Grid: Number of Levels: 6 LVL ROWS NNZ SPRSTY Mem (GB)

       0(D)        10000             49600  0.000496       0.000848
       1(D)         4730             25612   0.00114       0.000781
       2(D)         2182             13218   0.00278       0.000392
       3(D)         1000              6524   0.00652        0.00019
       4(D)          464              3058    0.0142       8.88e-05
       5(D)          211              1365    0.0307       3.68e-05
     Grid Complexity: 1.8587
     Operator Complexity: 2.00357
     Total Memory Usage: 0.00233688 GB

AMGX_solver_solve_with_0_initial_guess(m_solver, m_rhs, m_solution) iter Mem Usage (GB) residual rate

        **Ini            0.909485   9.969297e-02**
     Total Iterations: 0
     Avg Convergence Rate:             1.000000
     Final Residual:           9.969297e-02
     Total Reduction in Residual:      1.000000e+00
     Maximum Memory Usage:                0.909 GB

Total Time: 0.0314893 setup: 0.0293952 s solve: 0.00209405 s solve(per iteration): 0 s again "0" iters

and here is the same case called from a unit test and its output AMGX version Built on Jun 6 2020, 20:32:19 Compiled with CUDA Runtime 10.2, using CUDA driver 10.2 m_rank 0 m_nRank 1 m_ndevice 2 m_nHost 1 Warning: using only 1 of 2 available GPUs AMGX_config_create_from_file(&m_config, param_file) AMGX_resources_create(&m_resources, m_config, static_cast<void>(&m_amgx_comm), m_max_device_per_host, (const int)(&m_device_id)) AMGX_matrix_create(&m_matrix, m_resources, m_mode) AMGX_vector_create(&m_rhs, m_resources, m_mode) AMGX_vector_create(&m_solution, m_resources, m_mode) AMGX_solver_create(&m_solver, m_resources, m_mode, m_config) sqrt 1 AMGX_matrix_comm_from_maps_one_ring(m_matrix, m_allocated_halo_depth, m_num_nbrs, m_nbrs, m_send_size, m_send_map, m_recv_size, m_recv_map) AMGX_matrix_upload_all(m_matrix, m_n, m_nnz, m_block_dimx, m_block_dimy, (const int)row_ptrs, (const int)col_indices, (const double)data, (const double)diag_data) AMGX_vector_bind(m_rhs, m_matrix) AMGX_vector_bind(m_solution, m_matrix) AMGX_vector_upload(m_rhs, m_n_plus_ghost, m_block_size, rhs) AMGX_vector_upload(m_solution, m_n_plus_ghost, m_block_size, rhs) AMGX_solver_setup(m_solver, m_matrix) AMG Grid: Number of Levels: 6 LVL ROWS NNZ SPRSTY Mem (GB)

       0(D)        10000             49600  0.000496       0.000848
       1(D)         4730             25612   0.00114       0.000781
       2(D)         2182             13218   0.00278       0.000392
       3(D)         1000              6524   0.00652        0.00019
       4(D)          464              3058    0.0142       8.88e-05
       5(D)          211              1365    0.0307       3.68e-05
     Grid Complexity: 1.8587
     Operator Complexity: 2.00357
     Total Memory Usage: 0.00233688 GB

AMGX_solver_solve_with_0_initial_guess(m_solver, m_rhs, m_solution) iter Mem Usage (GB) residual rate

        **Ini            0.909485   9.969297e-02**
          0            0.909485   9.867779e-02         0.9898
          1              0.9095   3.703747e-02         0.3753
          2              0.9095   1.228826e-02         0.3318
          3              0.9095   4.846841e-03         0.3944
          4              0.9095   2.128492e-03         0.4392
          5              0.9095   9.672328e-04         0.4544
          6              0.9095   4.243034e-04         0.4387
          7              0.9095   1.947642e-04         0.4590
          8              0.9095   8.401128e-05         0.4313
          9              0.9095   3.381623e-05         0.4025
         10              0.9095   1.696738e-05         0.5018
         11              0.9095   7.558131e-06         0.4455
         12              0.9095   3.316206e-06         0.4388
         13              0.9095   1.404204e-06         0.4234
         14              0.9095   6.186404e-07         0.4406
         15              0.9095   2.787829e-07         0.4506
         16              0.9095   1.260950e-07         0.4523
         17              0.9095   4.818015e-08         0.3821
         18              0.9095   1.409761e-08         0.2926
         19              0.9095   8.047044e-09         0.5708
         20              0.9095   5.687245e-09         0.7067
         21              0.9095   3.133990e-09         0.5511
         22              0.9095   1.435876e-09         0.4582
         23              0.9095   5.912854e-10         0.4118
         24              0.9095   2.313654e-10         0.3913
         25              0.9095   9.721457e-11         0.4202
         26              0.9095   3.978193e-11         0.4092
         27              0.9095   1.710693e-11         0.4300
         28              0.9095   7.397181e-12         0.4324
         29              0.9095   3.755933e-12         0.5078
         30              0.9095   2.206292e-12         0.5874
         31              0.9095   1.030078e-12         0.4669
         32              0.9095   4.370271e-13         0.4243
         33              0.9095   1.771387e-13         0.4053
         34              0.9095   7.296272e-14         0.4119
     Total Iterations: 35
     Avg Convergence Rate:               0.4501
     Final Residual:           7.296272e-14
     Total Reduction in Residual:      7.318742e-13
     Maximum Memory Usage:                0.909 GB

Total Time: 0.159129 setup: 0.0357844 s solve: 0.123344 s solve(per iteration): 0.00352412 s stat of solve 0 AMGX_vector_download(m_solution, dest) err 8.38997e-05 As you see the grid data as well as initial res are identical

marsaev commented 4 years ago

Just to check - are there AMGX_initialize() and AMGX_finalize() calls in the code?

Jaberh commented 4 years ago

yes, a singleton class is reposible for initialization and cleanup, as this library will be used by other developers I wanted to prevent multiple calls, I did no put safe_call of initialize and finalize that is why it is not shown there

marsaev commented 4 years ago

Alright, those API calls looks good.

I made some progress today. There are double try/catch blocks inside the code and real issue is caught within the code, but error reported to the C API is incorrectly identified.

Jaberh commented 4 years ago

Do you suspect that this might be an issue related to something lower level that the programming, such as build, cuda installation, drivers, ... ?

marsaev commented 4 years ago

@Jaberh sorry, yesterday for some reason i couldn't add comments to this thread. I pushed a fix https://github.com/NVIDIA/AMGX/commit/7b4d431e67e9f86746166a4dae8de6434a78ac5a to the v2.1.x branch. Can you try the update?

Answering your question - no, this is purely AMGX bug.

Jaberh commented 4 years ago

Hi marsaev, I am git pulling it right now and will let you know Thanks for all your support and the replies

Jaberh commented 4 years ago

I built this one which has the latest commit. Unfortunately nothing changed, still unit test works fine and integrated version does not iterate. same output as above

commit 7b4d431e67e9f86746166a4dae8de6434a78ac5a Author: Marat Arsaev marsaev@nvidia.com Date: Thu Jun 25 23:01:31 2020 +0300

Disabling deferred tasks
marsaev commented 4 years ago

Got it.

Can you check what solver status is returned with AMGX_solver_get_status(solver, &status); after AMGX_solver_solve?

Also, can you try adding "solver_verbose" = 1 for both solver and smoother in the config, so something like this for AGGREGATION_JACOBI:

    "config_version": 2, 
    "determinism_flag": 1, 
    "solver": {
        "print_grid_stats": 1, 
        "algorithm": "AGGREGATION", 
        "obtain_timings": 1, 
        "solver": "AMG", 
        "smoother": {
            "solver" : "BLOCK_JACOBI",
            "scope" : "jacobi",
            "solver_verbose" : 1
        "print_solve_stats": 1, 
        "presweeps": 2, 
        "selector": "SIZE_2", 
        "convergence": "RELATIVE_MAX_CORE", 
        "coarsest_sweeps": 2, 
        "max_iters": 100, 
        "monitor_residual": 1, 
        "min_coarse_rows": 2, 
        "relaxation_factor": 0.75, 
        "scope": "main", 
        "max_levels": 1000, 
        "postsweeps": 2, 
        "tolerance": 0.1, 
        "norm": "L1", 
        "cycle": "V",
        "solver_verbose" : 1

you should see something like this in the output:

Parameters for solver: AMG with scope name: main

AMG solver settings:
cycle_iters = 2
norm = L1
presweeps = 2
postsweeps = 2
max_levels = 1000
coarsen_threshold = 1
min_fine_rows = 1
min_coarse_rows = 2
coarse_solver_d: DENSE_LU_SOLVER with scope name default
coarse_solver_h: DENSE_LU_SOLVER with scope name default
max_iters = 100
scaling = NONE
norm = L1
convergence = RELATIVE_MAX_CORE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 1
print_grid_stats = 1
print_vis_data = 0
monitor_residual = 1
store_res_history = 0
obtain_timings = 1

Parameters for solver: BLOCK_JACOBI with scope name: jacobi

relaxation_factor= 0.9
max_iters = 100
scaling = NONE
norm = L2
convergence = ABSOLUTE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 0
print_grid_stats = 0
print_vis_data = 0
monitor_residual = 0
store_res_history = 0
obtain_timings = 0

info for jacobi should be repeated for each amg level.

Jaberh commented 4 years ago

Hi Marat. sure, I have been checking for solver return value which is 0. and here is the output using your new config file global initilizer called

AMGX version
Built on Jun 25 2020, 15:44:52
Compiled with CUDA Runtime 10.2, using CUDA driver 10.2
 m_rank  0
 m_nRank  1
 m_ndevice  2
 m_nHost  1
Warning:  using only 1 of 2 available GPUs
AMGX_config_create_from_file(&m_config, param_file)
AMGX_resources_create(&m_resources, m_config, static_cast<void*>(&m_amgx_comm), m_max_device_per_host, (const int*)(&m_device_id))
AMGX_matrix_create(&m_matrix, m_resources, m_mode)
AMGX_vector_create(&m_rhs, m_resources, m_mode)
AMGX_vector_create(&m_solution, m_resources, m_mode)
AMGX_solver_create(&m_solver, m_resources, m_mode, m_config)
 sqrt 1
AMGX_matrix_comm_from_maps_one_ring(m_matrix, m_allocated_halo_depth, m_num_nbrs, m_nbrs, m_send_size, m_send_map, m_recv_size, m_recv_map)
AMGX_matrix_upload_all(m_matrix, m_n, m_nnz, m_block_dimx, m_block_dimy, (const int*)row_ptrs, (const int*)col_indices, (const double*)data, (const double*)diag_data)
AMGX_vector_bind(m_rhs, m_matrix)
AMGX_vector_bind(m_solution, m_matrix)
AMGX_vector_upload(m_rhs, m_n_plus_ghost, m_block_size, rhs)
AMGX_vector_upload(m_solution, m_n_plus_ghost, m_block_size, rhs)
AMGX_solver_setup(m_solver, m_matrix)
Parameters for solver: AMG with scope name: main

AMG solver settings:
cycle_iters = 2
norm = L1
presweeps = 2
postsweeps = 2
max_levels = 1000
coarsen_threshold = 1
min_fine_rows = 1
min_coarse_rows = 2
coarse_solver_d: DENSE_LU_SOLVER with scope name default
coarse_solver_h: DENSE_LU_SOLVER with scope name default
max_iters = 100
scaling = NONE
norm = L1
convergence = RELATIVE_MAX_CORE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 1
print_grid_stats = 1
print_vis_data = 0
monitor_residual = 1
store_res_history = 0
obtain_timings = 1

Parameters for solver: BLOCK_JACOBI with scope name: jacobi

relaxation_factor= 0.9
max_iters = 100
scaling = NONE
norm = L2
convergence = ABSOLUTE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 0
print_grid_stats = 0
print_vis_data = 0
monitor_residual = 0
store_res_history = 0
obtain_timings = 0

Parameters for solver: BLOCK_JACOBI with scope name: jacobi

relaxation_factor= 0.9
max_iters = 100
scaling = NONE
norm = L2
convergence = ABSOLUTE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 0
print_grid_stats = 0
print_vis_data = 0
monitor_residual = 0
store_res_history = 0
obtain_timings = 0

Parameters for solver: BLOCK_JACOBI with scope name: jacobi

relaxation_factor= 0.9
max_iters = 100
scaling = NONE
norm = L2
convergence = ABSOLUTE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 0
print_grid_stats = 0
print_vis_data = 0
monitor_residual = 0
store_res_history = 0
obtain_timings = 0

Parameters for solver: BLOCK_JACOBI with scope name: jacobi

relaxation_factor= 0.9
max_iters = 100
scaling = NONE
norm = L2
convergence = ABSOLUTE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 0
print_grid_stats = 0
print_vis_data = 0
monitor_residual = 0
store_res_history = 0
obtain_timings = 0

Parameters for solver: BLOCK_JACOBI with scope name: jacobi

relaxation_factor= 0.9
max_iters = 100
scaling = NONE
norm = L2
convergence = ABSOLUTE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 0
print_grid_stats = 0
print_vis_data = 0
monitor_residual = 0
store_res_history = 0
obtain_timings = 0

AMG Grid:
         Number of Levels: 6
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
           0(D)        10000             49600  0.000496       0.000848
           1(D)         4730             25612   0.00114       0.000781
           2(D)         2182             13218   0.00278       0.000392
           3(D)         1000              6524   0.00652        0.00019
           4(D)          464              3058    0.0142       8.88e-05
           5(D)          211              1365    0.0307       3.68e-05
         Grid Complexity: 1.8587
         Operator Complexity: 2.00357
         Total Memory Usage: 0.00233688 GB
AMGX_solver_solve_with_0_initial_guess(m_solver, m_rhs, m_solution)
           iter      Mem Usage (GB)       residual           rate
            Ini            0.911438   7.998657e+00
         Total Iterations: 0
         Avg Convergence Rate:             1.000000
         Final Residual:           7.998657e+00
         Total Reduction in Residual:      1.000000e+00
         Maximum Memory Usage:                0.911 GB
Total Time: 0.0246834
    setup: 0.0222985 s
    solve: 0.00238486 s
    solve(per iteration): 0 s
 **stat of solve 0**
AMGX_vector_download(m_solution, dest)
err 1.00176
 global destructor called
marsaev commented 4 years ago

Looks all right. You are saying that it does iterate with the same config with the same matrix in the unit test, but not in the app code, right?

Jaberh commented 4 years ago

exactly, here is the same config for unit test,

global initializer called.

AMGX version
Built on Jun 25 2020, 15:44:52
Compiled with CUDA Runtime 10.2, using CUDA driver 10.2
 m_rank  0
 m_nRank  1
 m_ndevice  2
 m_nHost  1
Warning:  using only 1 of 2 available GPUs
AMGX_config_create_from_file(&m_config, param_file)
AMGX_resources_create(&m_resources, m_config, static_cast<void*>(&m_amgx_comm), m_max_device_per_host, (const int*)(&m_device_id))
AMGX_matrix_create(&m_matrix, m_resources, m_mode)
AMGX_vector_create(&m_rhs, m_resources, m_mode)
AMGX_vector_create(&m_solution, m_resources, m_mode)
AMGX_solver_create(&m_solver, m_resources, m_mode, m_config)
 sqrt 1
AMGX_matrix_comm_from_maps_one_ring(m_matrix, m_allocated_halo_depth, m_num_nbrs, m_nbrs, m_send_size, m_send_map, m_recv_size, m_recv_map)
AMGX_matrix_upload_all(m_matrix, m_n, m_nnz, m_block_dimx, m_block_dimy, (const int*)row_ptrs, (const int*)col_indices, (const double*)data, (const double*)diag_data)
AMGX_vector_bind(m_rhs, m_matrix)
AMGX_vector_bind(m_solution, m_matrix)
AMGX_vector_upload(m_rhs, m_n_plus_ghost, m_block_size, rhs)
AMGX_vector_upload(m_solution, m_n_plus_ghost, m_block_size, rhs)
AMGX_solver_setup(m_solver, m_matrix)
Parameters for solver: AMG with scope name: main

AMG solver settings:
cycle_iters = 2
norm = L1
presweeps = 2
postsweeps = 2
max_levels = 1000
coarsen_threshold = 1
min_fine_rows = 1
min_coarse_rows = 2
coarse_solver_d: DENSE_LU_SOLVER with scope name default
coarse_solver_h: DENSE_LU_SOLVER with scope name default
max_iters = 100
scaling = NONE
norm = L1
convergence = RELATIVE_MAX_CORE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 1
print_grid_stats = 1
print_vis_data = 0
monitor_residual = 1
store_res_history = 0
obtain_timings = 1

Parameters for solver: BLOCK_JACOBI with scope name: jacobi

relaxation_factor= 0.9
max_iters = 100
scaling = NONE
norm = L2
convergence = ABSOLUTE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 0
print_grid_stats = 0
print_vis_data = 0
monitor_residual = 0
store_res_history = 0
obtain_timings = 0

Parameters for solver: BLOCK_JACOBI with scope name: jacobi

relaxation_factor= 0.9
max_iters = 100
scaling = NONE
norm = L2
convergence = ABSOLUTE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 0
print_grid_stats = 0
print_vis_data = 0
monitor_residual = 0
store_res_history = 0
obtain_timings = 0

Parameters for solver: BLOCK_JACOBI with scope name: jacobi

relaxation_factor= 0.9
max_iters = 100
scaling = NONE
norm = L2
convergence = ABSOLUTE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 0
print_grid_stats = 0
print_vis_data = 0
monitor_residual = 0
store_res_history = 0
obtain_timings = 0

Parameters for solver: BLOCK_JACOBI with scope name: jacobi

relaxation_factor= 0.9
max_iters = 100
scaling = NONE
norm = L2
convergence = ABSOLUTE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 0
print_grid_stats = 0
print_vis_data = 0
monitor_residual = 0
store_res_history = 0
obtain_timings = 0

Parameters for solver: BLOCK_JACOBI with scope name: jacobi

relaxation_factor= 0.9
max_iters = 100
scaling = NONE
norm = L2
convergence = ABSOLUTE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 0
print_grid_stats = 0
print_vis_data = 0
monitor_residual = 0
store_res_history = 0
obtain_timings = 0

AMG Grid:
         Number of Levels: 6
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
           0(D)        10000             49600  0.000496       0.000848
           1(D)         4730             25612   0.00114       0.000781
           2(D)         2182             13218   0.00278       0.000392
           3(D)         1000              6524   0.00652        0.00019
           4(D)          464              3058    0.0142       8.88e-05
           5(D)          211              1365    0.0307       3.68e-05
         Grid Complexity: 1.8587
         Operator Complexity: 2.00357
         Total Memory Usage: 0.00233688 GB
AMGX_solver_solve_with_0_initial_guess(m_solver, m_rhs, m_solution)
           iter      Mem Usage (GB)       residual           rate
            Ini            0.911438   7.998657e+00
              0            0.911438   8.579094e+00         1.0726
              1              0.9114   8.166217e+00         0.9519
              2              0.9114   7.458036e+00         0.9133
              3              0.9114   6.721179e+00         0.9012
              4              0.9114   6.026819e+00         0.8967
              5              0.9114   5.392025e+00         0.8947
              6              0.9114   4.818757e+00         0.8937
              7              0.9114   4.304031e+00         0.8932
              8              0.9114   3.842848e+00         0.8928
              9              0.9114   3.430228e+00         0.8926
             10              0.9114   3.061298e+00         0.8924
             11              0.9114   2.731515e+00         0.8923
             12              0.9114   2.436909e+00         0.8921
             13              0.9114   2.173792e+00         0.8920
             14              0.9114   1.938851e+00         0.8919
             15              0.9114   1.729142e+00         0.8918
             16              0.9114   1.541975e+00         0.8918
             17              0.9114   1.374959e+00         0.8917
             18              0.9114   1.225957e+00         0.8916
             19              0.9114   1.093047e+00         0.8916
             20              0.9114   9.744872e-01         0.8915
             21              0.9114   8.687460e-01         0.8915
             22              0.9114   7.744420e-01         0.8914
         Total Iterations: 23
         Avg Convergence Rate:               0.9035
         Final Residual:           7.744420e-01
         Total Reduction in Residual:      9.682150e-02
         Maximum Memory Usage:                0.911 GB
Total Time: 0.0520445
    setup: 0.0229698 s
    solve: 0.0290748 s
    solve(per iteration): 0.00126412 s
 stat of solve 0
AMGX_vector_download(m_solution, dest)
err 0.0657659
 global destructor called
marsaev commented 4 years ago

I don't have any ideas why it might be happening without hands on on the code, considering that standalone execution functions properly.

Is it for both release and debug? If it happens in debug - can you try tracing with gdb through solve process and see where early exit happens? Function of interest would be https://github.com/NVIDIA/AMGX/blob/d0019e5d32e99e7d679b3b773cf16b6f8e7da6f9/base/src/solvers/solver.cu#L589 and, in particular, this solve iteration loop: https://github.com/NVIDIA/AMGX/blob/d0019e5d32e99e7d679b3b773cf16b6f8e7da6f9/base/src/solvers/solver.cu#L798

Jaberh commented 4 years ago

ok, this is very odd to me as well, I will build a debug version and monitor that function! It might be beneficial to share this with a build specialist at NVIDIA as well, I am building a debug version and digging into this. Spasiba

marsaev commented 4 years ago

For debugging purposes you can set coarse solver and smoother to NOSOLVER, so that only AMG solver would go to this code piece. AMG would still iterate in that case, but residual should not decrease - this should be enough to try debug issue where it's not iterating at all. For example:

    "config_version": 2, 
    "determinism_flag": 1, 
    "solver": {
        "print_grid_stats": 1, 
        "algorithm": "AGGREGATION", 
        "obtain_timings": 1, 
        "solver": "AMG", 
        "smoother": {
            "solver" : "NOSOLVER",
            "scope" : "jacobi"
        "coarse_solver" : 
            "solver" : "NOSOLVER",
            "scope" : "dense"
        "print_solve_stats": 1, 
        "presweeps": 2, 
        "selector": "SIZE_2", 
        "convergence": "RELATIVE_MAX_CORE", 
        "coarsest_sweeps": 2, 
        "max_iters": 5, 
        "monitor_residual": 1, 
        "min_coarse_rows": 2, 
        "relaxation_factor": 0.75, 
        "scope": "main", 
        "max_levels": 1000, 
        "postsweeps": 2, 
        "tolerance": 0.1, 
        "norm": "L1", 
        "cycle": "V"
Jaberh commented 4 years ago

I quickly tried this no luck, I will try this in the debugger as well. I was gonna put the break point in the loop you mentioned above,

         Number of Levels: 12
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
           0(D)        10000             49600  0.000496       0.000848
           1(D)         4730             25612   0.00114       0.000781
           2(D)         2182             13218   0.00278       0.000392
           3(D)         1000              6524   0.00652        0.00019
           4(D)          464              3058    0.0142       8.88e-05
           5(D)          211              1365    0.0307       3.98e-05
           6(D)           97               609    0.0647       1.79e-05
           7(D)           45               269     0.133       8.03e-06
           8(D)           21               117     0.265       3.58e-06
           9(D)           10                50       0.5       1.59e-06
          10(D)            5                21      0.84       7.15e-07
          11(D)            2                 4         1       1.79e-07
         Grid Complexity: 1.8767
         Operator Complexity: 2.02514
         Total Memory Usage: 0.00237192 GB
AMGX_solver_solve_with_0_initial_guess(m_solver, m_rhs, m_solution)
           iter      Mem Usage (GB)       residual           rate
            Ini            0.802063   7.998657e+00
         Total Iterations: 0
         Avg Convergence Rate:             1.000000
         Final Residual:           7.998657e+00
         Total Reduction in Residual:      1.000000e+00
         Maximum Memory Usage:                0.802 GB
Total Time: 0.158886
    setup: 0.14185 s
    solve: 0.0170352 s
    solve(per iteration): 0 s
 stat of solve 0
Jaberh commented 4 years ago

Hi Marat. So I am trying to build this in debug mode, so I can track down the iterations in cuda-gdb but I get the following error

/lib/../lib64/crti.o: In function `_init':
(.init+0x7): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol `__gmon_start__'
tools/centos/6/gcc/8.2.0/lib/gcc/x86_64-pc-linux-gnu/8.2.0/crtbeginS.o: In function `deregister_tm_clones':
crtstuff.c:(.text+0x3): relocation truncated to fit: R_X86_64_PC32 against `.tm_clone_table'
crtstuff.c:(.text+0xa): relocation truncated to fit: R_X86_64_PC32 against symbol `__TMC_END__' defined in .nvFatBinSegment section in libamgxsh.so
crtstuff.c:(.text+0x16): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol `_ITM_deregisterTMCloneTable'
/tools/centos/6/gcc/8.2.0/lib/gcc/x86_64-pc-linux-gnu/8.2.0/crtbeginS.o: In function `register_tm_clones':
crtstuff.c:(.text+0x33): relocation truncated to fit: R_X86_64_PC32 against `.tm_clone_table'
crtstuff.c:(.text+0x3a): relocation truncated to fit: R_X86_64_PC32 against symbol `__TMC_END__' defined in .nvFatBinSegment section in libamgxsh.so
crtstuff.c:(.text+0x57): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol `_ITM_registerTMCloneTable'
/tools/centos/6/gcc/8.2.0/lib/gcc/x86_64-pc-linux-gnu/8.2.0/crtbeginS.o: In function `__do_global_dtors_aux':
crtstuff.c:(.text+0x72): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x7d): relocation truncated to fit: R_X86_64_GOTPCREL against symbol `__cxa_finalize@@GLIBC_2.2.5' defined in .text section in /lib64/libc.so.6
crtstuff.c:(.text+0x8d): relocation truncated to fit: R_X86_64_PC32 against symbol `__dso_handle' defined in .data.rel.local section in /tools/centos/6/gcc/8.2.0/lib/gcc/x86_64-pc-linux-gnu/8.2.0/crtbeginS.o
crtstuff.c:(.text+0x99): additional relocation overflows omitted from the output
libamgxsh.so: PC-relative offset overflow in PLT entry for `_ZN9__gnu_cxx13new_allocatorISt13_Rb_tree_nodeISt4pairIKPN4amgx11CWrapHandleIP25AMGX_vector_handle_structNS3_6VectorINS3_14TemplateConfigIL16AMGX_MemorySpace1EL17AMGX_VecPrecision0EL17AMGX_MatPrecision1EL17AMGX_IndPrecision2EEEEEEESt10shared_ptrISF_EEEE8allocateEmPK
marsaev commented 4 years ago

It seems that produced code is too large for the linker to process.

There are number of ways to reduce amount of generated code, but that might be little to adventurous :) I can try to provide you debug build on centos:centos6.9 with devtoolset-8 - would that work? Which cuda/mpi are you compiling against?

Jaberh commented 4 years ago

openmpi/3.1.3, cuda/10.2.89

Jaberh commented 4 years ago

here is the comparison of what test unit and real code outputs calling the same method:

AMG solver settings:
cycle_iters = 2
norm = L2
presweeps = 2
postsweeps = 2
max_levels = 1000
coarsen_threshold = 1
min_fine_rows = 1
min_coarse_rows = 2
coarse_solver_d: DENSE_LU_SOLVER with scope name default
coarse_solver_h: DENSE_LU_SOLVER with scope name default
max_iters = 2
scaling = NONE
norm = L2
convergence = RELATIVE_MAX_CORE
solver_verbose= 1
use_scalar_norm = 0
print_solve_stats = 1
print_grid_stats = 1
print_vis_data = 0
monitor_residual = 1
store_res_history = 1
obtain_timings = 1

AMG Grid:
         Number of Levels: 6
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
           0(D)        10000             49600  0.000496       0.000848
           1(D)         4730             25612   0.00114       0.000781
           2(D)         2182             13218   0.00278       0.000392
           3(D)         1000              6524   0.00652        0.00019
           4(D)          464              3058    0.0142       8.88e-05
           5(D)          211              1365    0.0307       3.68e-05
         Grid Complexity: 1.8587
         Operator Complexity: 2.00357
         Total Memory Usage: 0.00233688 GB
AMGX_solver_solve_with_0_initial_guess(m_solver, m_rhs, m_solution)
           iter      Mem Usage (GB)       residual           rate
            Ini            0.911438   9.969297e-02
 max iters ???????????? 2
called converged() line 295  monitor convergence 1
 converged():  1
 **done 1**
AMG solver settings:
cycle_iters = 2 
norm = L2
presweeps = 2 
postsweeps = 2 
max_levels = 1000
coarsen_threshold = 1 
min_fine_rows = 1 
min_coarse_rows = 2 
coarse_solver_d: DENSE_LU_SOLVER with scope name default
coarse_solver_h: DENSE_LU_SOLVER with scope name default
max_iters = 2 
scaling = NONE
norm = L2
convergence = RELATIVE_MAX_CORE
solver_verbose= 1
use_scalar_norm = 0 
print_solve_stats = 1 
print_grid_stats = 1 
print_vis_data = 0 
monitor_residual = 1 
store_res_history = 1 
obtain_timings = 1 

AMG Grid:
         Number of Levels: 6
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
           0(D)        10000             49600  0.000496       0.000848
           1(D)         4730             25612   0.00114       0.000781
           2(D)         2182             13218   0.00278       0.000392
           3(D)         1000              6524   0.00652        0.00019
           4(D)          464              3058    0.0142       8.88e-05
           5(D)          211              1365    0.0307       3.68e-05
         Grid Complexity: 1.8587
         Operator Complexity: 2.00357
         Total Memory Usage: 0.00233688 GB
AMGX_solver_solve_with_0_initial_guess(m_solver, m_rhs, m_solution)
           iter      Mem Usage (GB)       residual           rate
            Ini            0.911438   9.969297e-02
 max iters ???????????? 2
called converged()  monitor convergence 1
 converged():  0
 **done 0**

I also noticed that this function

template<class TConfig>
bool AbsoluteConvergence<TConfig>::convergence_update_and_check(const PODVec_h &nrm, const PODVec_h &nrm_ini)
    printf("Check tolerance: %16.16lf norm_size %d\n", this->m_tolerance,nrm.size());
    bool res_converged = true;
    bool res_converged_rel = true;

    for (int i = 0; i < nrm.size(); i++)
        bool conv = nrm[i] < this->m_tolerance;
        res_converged = res_converged && conv;
        bool conv_rel = nrm[i] < Epsilon_conv<ValueTypeB>::value() * nrm_ini[i];
        res_converged_rel = res_converged_rel && conv_rel;
        printf("nrm %lf nrm_ini %lf Epsilon_conv %lf  \n", nrm[i], Epsilon_conv<ValueTypeB>::value(),nrm_ini[i]);

     //   printf("res_converged_rel %d  \n", res_converged_rel);

    if (res_converged_rel)
        std::stringstream ss; 
        ss << "Relative residual has reached machine precision" << std::endl;
        amgx_output(ss.str().c_str(), static_cast<int>(ss.str().length()));
        return true;

    return res_converged;
for both cases reports
Check tolerance: 0.0000000000000000 norm_size 0 
Additionally this method 
`m_convergence->convergence_update_and_check(m_nrm, m_nrm_ini)`
returns true for the real code and false for unit test with the same  inputs 

Assuming that initial residual is the same for both cases, converged() should return the same boolean for both cases, for the real code, it considers it as converged although init Res is huge, If I remove the !done from the iterating loop and hence force the given number of iterations, it converges to the desired tolerance. Let me know what you think, thanks

Jaberh commented 4 years ago

the solver_verbose does not print out the tolerance, so I tracked down why that function returns differently. Apparently, the tolerance while reading is not set as it shows as "tol1e+298" whereas the unit test reads it correctly as tol 1e-10. everything else is read correctly, I think it is a good idea to add tolerace to this to solver_verbose items as it can catch the errors like this pretty easily. Having said that I am not sure why only this parameter is being read wrong. If I hard code the tolerance this->m_tolerance=1.e-10; it converge,

marsaev commented 4 years ago

Great progress! Still it would be good to udnerstand where incorrect read comes from - from JSON parser, or it is modified somewhere. I have sent you debug binary if you are willing to track this issue further using gdb

Jaberh commented 4 years ago

sure, I am happy to help debug the issue as I am passed overdue to get this to work Did you email?

Jaberh commented 4 years ago

all the doubles are wrong basically tolerance and relaxation, should be related to Type AMG_Config::getParameter(const string &name, const string &current_scope) returns doubles with extremely big exponents e307 .... JSON pareser parses correctly as the issue is still there when I hard code the config parameters, here are the output from the import json object, the read is ok name relaxation_factor value 7.5e+307 Parsing parameter with name "max_levels" of type Number Parsing as int Parsing parameter with name "postsweeps" of type Number Parsing as int Parsing parameter with name "tolerance" of type Number Parsing as double name tolerance value 1e+298

    double GetDouble() const {
                 if ((flags_ & kDoubleFlag) != 0)                                return data_.n.d;       // exact type, no conversion.
                 if ((flags_ & kIntFlag) != 0)                                   return data_.n.i.i;     // int -> double
                 if ((flags_ & kUintFlag) != 0)                                  return data_.n.u.u;     // unsigned -> double
                 if ((flags_ & kInt64Flag) != 0)                                 return (double)data_.n.i64; // int64_t -> double (may lose precision)
                 RAPIDJSON_ASSERT((flags_ & kUint64Flag) != 0);  return (double)data_.n.u64;     // uint64_t -> double (may lose precision)
Just to confirm that parsing is ok, I printed out param that is being processed in 
MGX_ERROR AMG_Config::parse_json_file(const char *filename) for both unit test and real code are identical
AMGX_config_create_from_file(&m_config, param_file)
 params {
    "config_version": 2, 
    "determinism_flag": 1, 
    "solver": {
        "print_grid_stats": 1, 
        "algorithm": "AGGREGATION", 
        "obtain_timings": 1, 
        "solver": "AMG", 
        "smoother": "BLOCK_JACOBI", 
        "print_solve_stats": 1, 
        "presweeps": 2, 
        "selector": "SIZE_2", 
        "convergence": "RELATIVE_MAX_CORE", 
        "coarsest_sweeps": 2, 
        "max_iters": 2, 
        "monitor_residual": 1, 
        "min_coarse_rows": 2, 
        "relaxation_factor": 0.75, 
        "scope": "main", 
        "max_levels": 1000, 
        "postsweeps": 2, 
        "norm": "L2", 
        "use_scalar_norm": 1,
        "cycle": "V",
         "store_res_history": 1,
        "solver_verbose" : 1 
Jaberh commented 4 years ago

Hi Marat, I have been further debugging this, and I noticed that the c_value being passed here is wrong,
void AMG_Config::setNamedParameter(const string &name, const double &c_value, const std::string &current_scope, const std::string &new_scope, ParamDesc::iterator &param_desc_iter), the params in parse_json_file is correct, the problem is either related to json_parser.Parse<0>(params.c_str()) or import_json_object(json_parser, true);

marsaev commented 4 years ago

Sorry, was away for a holiday.

I got your email from github commit logs, but i think it is wrong sine i've got not delivered notification. You can get binary here: https://drive.google.com/file/d/1qB1Q5SpqtsG54JVJrOst6S3lmFw56Lcd/view?usp=sharing

marsaev commented 4 years ago

So, the value in rapidjson::Value in import_json_object() is correct, but the actual value c_value that is passed to the setNamedParameter<double>() is wrong?

Jaberh commented 4 years ago

Here is the issue with RAPID JASON

here is the buggy function in RAPID JASON, I guess we have to debug for third party libs as well inline double Pow10(int n) { static const double e[] = { // 1e-308...1e308: 617 * 8 bytes = 4936 bytes 1e-308,1e-307,1e-306,1e-305,1e-304,1e-303,1e-302,1e-301,1e-300, 1e-299,1e-298,1e-297,1e-296,1e-295,1e-294,1e-293,1e-292,1e-291,1e-290,1e-289,1e-288,1e-287,1e-286,1e-285,1e-284,1e-283,1e-282,1e-281,1e-280, 1e-279,1e-278,1e-277,1e-276,1e-275,1e-274,1e-273,1e-272,1e-271,1e-270,1e-269,1e-268,1e-267,1e-266,1e-265,1e-264,1e-263,1e-262,1e-261,1e-260, 1e-259,1e-258,1e-257,1e-256,1e-255,1e-254,1e-253,1e-252,1e-251,1e-250,1e-249,1e-248,1e-247,1e-246,1e-245,1e-244,1e-243,1e-242,1e-241,1e-240, 1e-239,1e-238,1e-237,1e-236,1e-235,1e-234,1e-233,1e-232,1e-231,1e-230,1e-229,1e-228,1e-227,1e-226,1e-225,1e-224,1e-223,1e-222,1e-221,1e-220, 1e-219,1e-218,1e-217,1e-216,1e-215,1e-214,1e-213,1e-212,1e-211,1e-210,1e-209,1e-208,1e-207,1e-206,1e-205,1e-204,1e-203,1e-202,1e-201,1e-200, 1e-199,1e-198,1e-197,1e-196,1e-195,1e-194,1e-193,1e-192,1e-191,1e-190,1e-189,1e-188,1e-187,1e-186,1e-185,1e-184,1e-183,1e-182,1e-181,1e-180, 1e-179,1e-178,1e-177,1e-176,1e-175,1e-174,1e-173,1e-172,1e-171,1e-170,1e-169,1e-168,1e-167,1e-166,1e-165,1e-164,1e-163,1e-162,1e-161,1e-160, 1e-159,1e-158,1e-157,1e-156,1e-155,1e-154,1e-153,1e-152,1e-151,1e-150,1e-149,1e-148,1e-147,1e-146,1e-145,1e-144,1e-143,1e-142,1e-141,1e-140, 1e-139,1e-138,1e-137,1e-136,1e-135,1e-134,1e-133,1e-132,1e-131,1e-130,1e-129,1e-128,1e-127,1e-126,1e-125,1e-124,1e-123,1e-122,1e-121,1e-120, 1e-119,1e-118,1e-117,1e-116,1e-115,1e-114,1e-113,1e-112,1e-111,1e-110,1e-109,1e-108,1e-107,1e-106,1e-105,1e-104,1e-103,1e-102,1e-101,1e-100, 1e-99, 1e-98, 1e-97, 1e-96, 1e-95, 1e-94, 1e-93, 1e-92, 1e-91, 1e-90, 1e-89, 1e-88, 1e-87, 1e-86, 1e-85, 1e-84, 1e-83, 1e-82, 1e-81, 1e-80, 1e-79, 1e-78, 1e-77, 1e-76, 1e-75, 1e-74, 1e-73, 1e-72, 1e-71, 1e-70, 1e-69, 1e-68, 1e-67, 1e-66, 1e-65, 1e-64, 1e-63, 1e-62, 1e-61, 1e-60, 1e-59, 1e-58, 1e-57, 1e-56, 1e-55, 1e-54, 1e-53, 1e-52, 1e-51, 1e-50, 1e-49, 1e-48, 1e-47, 1e-46, 1e-45, 1e-44, 1e-43, 1e-42, 1e-41, 1e-40, 1e-39, 1e-38, 1e-37, 1e-36, 1e-35, 1e-34, 1e-33, 1e-32, 1e-31, 1e-30, 1e-29, 1e-28, 1e-27, 1e-26, 1e-25, 1e-24, 1e-23, 1e-22, 1e-21, 1e-20, 1e-19, 1e-18, 1e-17, 1e-16, 1e-15, 1e-14, 1e-13, 1e-12, 1e-11, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e+0, 1e+1, 1e+2, 1e+3, 1e+4, 1e+5, 1e+6, 1e+7, 1e+8, 1e+9, 1e+10, 1e+11, 1e+12, 1e+13, 1e+14, 1e+15, 1e+16, 1e+17, 1e+18, 1e+19, 1e+20, 1e+21, 1e+22, 1e+23, 1e+24, 1e+25, 1e+26, 1e+27, 1e+28, 1e+29, 1e+30, 1e+31, 1e+32, 1e+33, 1e+34, 1e+35, 1e+36, 1e+37, 1e+38, 1e+39, 1e+40, 1e+41, 1e+42, 1e+43, 1e+44, 1e+45, 1e+46, 1e+47, 1e+48, 1e+49, 1e+50, 1e+51, 1e+52, 1e+53, 1e+54, 1e+55, 1e+56, 1e+57, 1e+58, 1e+59, 1e+60, 1e+61, 1e+62, 1e+63, 1e+64, 1e+65, 1e+66, 1e+67, 1e+68, 1e+69, 1e+70, 1e+71, 1e+72, 1e+73, 1e+74, 1e+75, 1e+76, 1e+77, 1e+78, 1e+79, 1e+80, 1e+81, 1e+82, 1e+83, 1e+84, 1e+85, 1e+86, 1e+87, 1e+88, 1e+89, 1e+90, 1e+91, 1e+92, 1e+93, 1e+94, 1e+95, 1e+96, 1e+97, 1e+98, 1e+99, 1e+100, 1e+101,1e+102,1e+103,1e+104,1e+105,1e+106,1e+107,1e+108,1e+109,1e+110,1e+111,1e+112,1e+113,1e+114,1e+115,1e+116,1e+117,1e+118,1e+119,1e+120, 1e+121,1e+122,1e+123,1e+124,1e+125,1e+126,1e+127,1e+128,1e+129,1e+130,1e+131,1e+132,1e+133,1e+134,1e+135,1e+136,1e+137,1e+138,1e+139,1e+140, 1e+141,1e+142,1e+143,1e+144,1e+145,1e+146,1e+147,1e+148,1e+149,1e+150,1e+151,1e+152,1e+153,1e+154,1e+155,1e+156,1e+157,1e+158,1e+159,1e+160, 1e+161,1e+162,1e+163,1e+164,1e+165,1e+166,1e+167,1e+168,1e+169,1e+170,1e+171,1e+172,1e+173,1e+174,1e+175,1e+176,1e+177,1e+178,1e+179,1e+180, 1e+181,1e+182,1e+183,1e+184,1e+185,1e+186,1e+187,1e+188,1e+189,1e+190,1e+191,1e+192,1e+193,1e+194,1e+195,1e+196,1e+197,1e+198,1e+199,1e+200, 1e+201,1e+202,1e+203,1e+204,1e+205,1e+206,1e+207,1e+208,1e+209,1e+210,1e+211,1e+212,1e+213,1e+214,1e+215,1e+216,1e+217,1e+218,1e+219,1e+220, 1e+221,1e+222,1e+223,1e+224,1e+225,1e+226,1e+227,1e+228,1e+229,1e+230,1e+231,1e+232,1e+233,1e+234,1e+235,1e+236,1e+237,1e+238,1e+239,1e+240, 1e+241,1e+242,1e+243,1e+244,1e+245,1e+246,1e+247,1e+248,1e+249,1e+250,1e+251,1e+252,1e+253,1e+254,1e+255,1e+256,1e+257,1e+258,1e+259,1e+260, 1e+261,1e+262,1e+263,1e+264,1e+265,1e+266,1e+267,1e+268,1e+269,1e+270,1e+271,1e+272,1e+273,1e+274,1e+275,1e+276,1e+277,1e+278,1e+279,1e+280, 1e+281,1e+282,1e+283,1e+284,1e+285,1e+286,1e+287,1e+288,1e+289,1e+290,1e+291,1e+292,1e+293,1e+294,1e+295,1e+296,1e+297,1e+298,1e+299,1e+300, 1e+301,1e+302,1e+303,1e+304,1e+305,1e+306,1e+307,1e+308 }; RAPIDJSON_ASSERT(n <= 308); return n < -308 ? 0.0 : e[n + 308]; } simply change this to POW(10,n) and it works

marsaev commented 4 years ago

Was the N passed to that function wrong? I wonder why it worked in one case but not another.

Jaberh commented 4 years ago

No the real code uses a lot of static global stuff so I think at some point it runs out. I don't like the look-up table there, bad practice, N is passsed right, look at the fix it does not alter N,

marsaev commented 4 years ago

I'm glad that you were able to identify the issue. I'm still not sure about real reason on what's happening, but i can agree that in our case possible performance benefit of lookup table is negligible and we can safely use pow. Since rapidjson is 3rd party code - let me clarify few things about making changes and perform few tests.

Jaberh commented 4 years ago

Thanks for the followup, I have one more issue to resolve and that is for certain cases some of my ranks have 0 number of elements which leads to failure at matrix construction, what is the best way to go about this? I can generate a communicator and only include ranks that have non-zero elements, which adds some collective call overheads or I can simulate that pretending that rank with zero element has only one neighbor that is self, since I dont know enough about AMGX's under the hood I would like to know your opinion on this, and this is something frequently happens in our simulations due to lots of refine/de-refinement. Also, does AMGX support lu(k) as well (I see "ilu_sparsity_level"), I wonder if it has ilu by threshold as well?
Spasiba for your help.

by the way the following might be better than good ole pow

inline double Pow10(int n) { std::string tmp; tmp="1e"+std::to_string(n); double ret=std::stod(tmp,nullptr); RAPIDJSON_ASSERT(n <= 308); RAPIDJSON_ASSERT(n > -308); return ret; }

Jaberh commented 4 years ago

one more question, is it possible to disable the print out of Using Normal MPI (Hostbuffer) communicator... it is unnecessary for realistic big case runs just clutters the log file, thanks again for your support

marsaev commented 4 years ago

Yep, will move it to higher verbosity level.


Jaberh commented 4 years ago

Hi Marat., I had one more question on the previous comment, most importantly I have one more issue to resolve and that is for certain cases some of my ranks have 0 number of elements which leads to failure at matrix construction, what is the best way to go about this? I can generate a communicator and only include ranks that have non-zero elements, which adds some collective call overheads or I can simulate that pretending that rank with zero element has only one neighbor that is self, since I don't know enough about AMGX's under the hood I would like to know your opinion on this, and this is something frequently happens in our simulations due to lots of refine/de-refinement. Can AMGX handle solving several disconnect graphs? In my example in deadlocks. I could get it to work with defining new MPI communicator though!

marsaev commented 4 years ago

We don't handle such cases specifically, so result would be unpredictable. Are there a lot of such ranks? How often set of ranks that has zero elements change? I would guess that cumulative performance penalty and where those penalties occur on any of those would depend on the specifics of your problem and the way you call AMGX, but both should work! If you need to have working solution right now probably would be a good idea to try both suggestions from outside AMGX. But you are right - the more balanced amount of data there is per GPU the more throughput is achievable - i.e. one of the reasons that solving wells/reservoir together in a single matrix would be a bad idea. My wild guess would be i think there is no critical infrastructure code changes to be done to support that case, but still some scoping is needed to see what's going on. If this case support would ever be implemented then it would be transparent for the user - just by providing 0 for rank's number of elements.

Jaberh commented 4 years ago

I think the most robust way to handle this is via communicators, it happens a lot in internal combustion engine simulations as the number of mesh in different phases changes drastically. Thanks for your feedback and support. We will soon try some realistic cases on a leadership class cluster