barbagroup / AmgXWrapper

AmgXWrapper: An interface between PETSc and the NVIDIA AmgX library
MIT License
45 stars 23 forks source link

Unstable, but reproducible, behavior of classical AMG preconditioner #34

Open aminamooie opened 3 years ago

aminamooie commented 3 years ago

Hello! I have been struggling with AmgX library (through the very useful AmgXWrapper tool) for quite a time in order to solve for my system pressure. While it can be fast, it is seriously sensitive to different library versions (e.g., AmgX, CUDA-toolkit, nvidia driver, mpi), where I have gotten various errors within the library on the same code and problem size (and config file) depending on such versioning parameters, such as 1) "free(): double free detected in tcache 2", 2) "Thrust failure: parallel_for failed: cudaErrorMemoryAllocation: out of memory", 3) "Caught amgx exception: Cuda failure: 'out of memory'", and most importantly 4) " On entry to cusparseSpMV_bufferSize() parameter number 1 (handle) had an illegal value: bad initialization or already destroyed".**

In my experience, switching between CUDA 10.2.2 and 11+ as well as changing problem size have been determining for some of these errors to appear or disappear otherwise! I now have to use CUDA 11+ since I am working with the new DGX-Station A100 with the Ampere Architecture, where problem size seem to be no problem on the paper (by a big margin) for the cases I am considering (but it does appear to be a problem in practice!). Right now I am persistently getting error # 4 mentioned above beyond a certain problem size and somewhere 'during' my simulations; the A matrix and rhs vector change dynamically during my simulations and the solver works until it crashes with the error copied below:

ERROR:

AMGX version 2.2.0.132-opensource Built on May 17 2021, 23:05:37 Compiled with CUDA Runtime 11.1, using CUDA driver 11.2 Cannot read file as JSON object, trying as AMGX config Cannot read file as JSON object, trying as AMGX config Converting config string to current config version Parsing configuration string: exception_handling=1 ; Using Normal MPI (Hostbuffer) communicator... ** On entry to cusparseSpMV_bufferSize() parameter number 1 (handle) had an illegal value: bad initialization or already destroyed

Caught amgx exception: CUSPARSE_STATUS_INVALID_VALUE at: /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/AMGX-main/base/src/amgx_cusparse.cu:1016 Stack trace: /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : void amgx::generic_SpMV<double, double, int>(cusparseContext, cusparseOperation_t, int, int, int, double const, double const, int const, int const, double const, double const, double, cudaDataType_t, cudaDataType_t)+0x2b0d /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::Cusparse::bsrmv(cusparseContext, cusparseDirection_t, cusparseOperation_t, int, int, int, double const, cusparseMatDescr, double const, int const, int const, int const, int, double const, double const, double)+0xe8 /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : void amgx::Cusparse::bsrmv_internal<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >(amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::VecPrec, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::VecPrec, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::ViewType, CUstream_st const&)+0x3c6 /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : void amgx::Cusparse::bsrmv<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >(amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::VecPrec, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::VecPrec, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::ViewType)+0x153 /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::Multiply_1x1<amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > >::multiply_1x1(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::ViewType)+0x39 /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : void amgx::multiply<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::ViewType)+0x14f /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::compute_residual(amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&)+0x5e /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::solve(amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, bool)+0x418 /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::solve_no_throw(amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::AMGX_STATUS&, bool)+0x85 /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::solve(amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::AMGX_STATUS&, bool)+0x41 /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : amgx::AMGX_ERROR amgx::(anonymous namespace)::solve_with<(AMGX_Mode)8193, amgx::AMG_Solver, amgx::Vector>(AMGX_solver_handle_struct, AMGX_vector_handle_struct, AMGX_vector_handle_struct, amgx::Resources, bool)+0x594 /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/build/lib/libamgxsh.so : AMGX_solver_solve()+0x430 ./teton_gpu : AmgXSolver::solve(double, double const*, int)+0x7a9 ./teton_gpu : PNM::linear_solver_petsc(std::vector<double, std::allocator > const&, std::vector<double, std::allocator > const&, std::vector<double, std::allocator >&, std::vector<unsigned int, std::allocator > const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::vector<unsigned int, std::allocator > const&, std::vector<unsigned int, std::allocator > const&, int, unsigned int, double&, int&)+0x7c2 ./teton_gpu : PNM::PressureSolver::solvePressureUnSteadyStatePetscCoInj(bool, std::pair<double, double>, std::pair<double, double>, std::pair<double, double>&, std::pair<double, double>&)+0x15eb ./teton_gpu : PNM::PNMOperation::findUSSPressField(bool, std::pair<double, double>, std::pair<double, double>)+0x110 ./teton_gpu : PNM::PNMOperation::convergePressField()+0x539 ./teton_gpu : PNM::WeakDynSimulation::run()+0xeaf ./teton_gpu : PNM::Simulation::execute()+0x3e ./teton_gpu : Application::exec()+0x9e3 ./teton_gpu : main()+0x1b7 /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0xf3 ./teton_gpu : _start()+0x2e

AMGX ERROR: file /home/aminamooie/Software/AMGX2.2_cuda11_openmpi/AMGX-main/base/src/amgx_c.cu line 2799 AMGX ERROR: CUDA kernel launch error.

For the same problem size this error didn't happen in my older Turing-based workstation with CUDA 10! It also didn't happen on the DGX-Station when I ran the whole thing with valgrind! The error gets delayed when I choose HMIS selector instead of PMIS for instance, and the minimum problem size where this happens increases with including 'aggressive levels' as in "FGMRES_CLASSICAL_AGGRESSIVE_PMIS.json" in the config. directory of AmgX library. Ultimately, opting for Aggregation based PC (i.e., AmgX_SolverOptions_AGG.info from the AmgXWrapper repo) seems to completely eliminate the error for up to the biggest problem size I have! This is great news for me but this is about 3-4 times slower than the Classical method -- not a preference if possible.

I was able to reproduce the error by saving the matrix and vector PETSc binary files (the input to AmgX.Set(A) and Solve(lhs, rhs)) right before the crash and give that to the solveFromFiles example of the AmgXWrapper project. Doing so, this stand-alone solver gives the same illegal handle error on both DGX-Station and old workstation. The strange thing is if I run it with a different number of GPU then it works (testifying to me, at least on the first look, that the matrix assembly and prior steps must have been fine). This goes the other way too: if I use 1 rank and 1 GPU in my simulation and save the matrix before crash, it will unexpectedly work in the stand-alone solver if run in parallel (and expectedly doesn't work if run with the original configuration). I just can't make sense of such an irrational behavior: how can the handle get destroyed or something all of a sudden during the simulation (and how this does not happen when using aggregation method or changing the runtime settings like the number of ranks/gpus)!?! This error also happens when I use CSR format (without any A assembly [@mattmartineau ])

My library settings: OpenMPI 4.1.1 (CUDA-Aware) gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Cuda compilation tools, release 11.1, V11.1.105 Driver Version: 460.73.01 petsc 3.15 amgx 2.2.0 amgxwrapper v1.5 (latest)

The attached files have been generated in my workstation with 2X GeForce RTX 2080 Ti, AMD® Ryzen threadripper 3970x 32-core processor, and 128GB RAM. I have obtained similar results on DGX-Station.

The files A_32_2gpus.dat and rhs_32_2gpus.dat are generated by the simulation with 32 mpi ranks and 2 visible GPUs: it crashes similarly within the solveFromFiles Ex (config file also attached for completeness). A typical runtime command is:

CUDA_VISIBLE_DEVICES=0,1 mpirun -n 32 ./solveFromFiles -caseName amin -mode AmgX_GPU -cfgFileName ../configs/AmgX_SolverOptions_Classical.info -matrixFileName A_32_2gpus.dat -rhsFileName rhs_32_2gpus.dat -exactFileName rhs_32_2gpus.dat -Nruns 0

Interestingly, changing the above to have CUDA_VISIBLE_DEVICES=1 (instead of '0,1') will make the solver work!

The files A_32_1gpu1.dat and rhs_32_1gpu1.dat are generated with 32 mpi ranks and 1 visible GPU: it crashes similarly within the solveFromFiles example with 'CUDA_VISIBLE_DEVICES=1 [or 0 of course] mpirun -n 32' in the runtime command but does work with CUDA_VISIBLE_DEVICES=0,1 mpirun -n 2 to 4 (and not with 8 and above~strange!!)

The files that include 'early' in their names are those from the same simulation far before the crash happens and they work regardless of any mpi rank and gpu number configurations within the stand-alone solver.

I really want this to work for us, and any help and insights are really appreciated.

The link to the attachment: https://uwy-my.sharepoint.com/:u:/g/personal/aamooie_uwyo_edu/EbnkiFHgb-xKqaWDfpWercgBGNAkyHeIN8-PFuYiNDyBmQ

piyueh commented 3 years ago

I have tested the files with two GPUs (A100 40GB version) and using solveFromFiles. Here are some observations:

  1. A_32_2gpus.dat
    1. Using both GPUs
      • 2, 4 MPI processes: worked fine
      • 8, 16, 32 MPI processes: gave free(): double free detected in tcache 2 error
    2. Using only one GPU (CUDA_VISIBLE_DEVICES=0 or 1)
      • 1, 2, 4, 8, 16, 32 MPI processes: all gave free(): double free detected in tcache 2 error
  2. A_32_1gpu1.dat.
    1. Using both GPUs
      • 2, 4, 16, 32 MPI processes: worked fine
      • 8 MPI process: gave free(): double free detected in tcache 2 error
    2. Using only one GPU (CUDA_VISIBLE_DEVICES=0 or 1)
      • 1, 2, 4, 8, 16, 32 MPI processes: all worked fine

I haven't done any debug, but here are some observations and thought:

  1. I didn't get the cusparseSpMV_bufferSize() error (the error message 4 in your list).
  2. When a run throws free(): double free detected in tcache 2 error, the solver actually has already solved the matrix system. That says, if running with -Nruns greater than 0, it can be seen that the solving actually happens. The free() error occurs when terminating the program and destroying data. Practically speaking, this should be fine in most situations because the calculations are done. It's just the program does not terminate properly.
  3. The possion example is working fine. Not sure what this means, but I believe this must mean something.
  4. As you mentioned, the free() error is just one of the errors you encountered. I think these errors may not be related to each other. They may be independent errors. It's my gut feeling that solving the free() error won't solve the cusparseSpMV_bufferSize() error. Now I think they may be related, but I can't reproduce the cusparseSpMV_bufferSize() error, so I'm just guessing.
  5. Other errors look like regular out-of-memory errors (the error messages 2 and 3 in your list). If they are indeed just regular our-of-memory situations, then they are not bugs. Maybe when you can reproduce errors 2 and 3, try to monitor the runtime memory usage? If the same problem size fits in your old GPU but runs out of memory on the latest GPU, maybe some unexpected jobs were using the GPUs?

As the cusparseSpMV_bufferSize() error can not be reproduced for now on my side, I will see if I can debug the free() error.

aminamooie commented 3 years ago

Thanks a lot, Pi-Yueh, for your swift response! We are unfortunately getting dissimilar results based on your observations Few notes: those files were generated on the old workstation; I just tested those with the DGX myself to make sure. The A_32_2gpus.dat one did crash with error message 4 using 'CUDA_VISIBLE_DEVICES=0,1 mpirun -n 32' but it did work with both 'CUDA_VISIBLE_DEVICES=1 mpirun -n 32' and surprisingly 'CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -n 4'!

I want to note that I used to get that double free error a while ago myself and frankly cannot pinpoint to a particular thing that made it go away (hence my overall perspective towards this being too sensitive over external things!). But I can say that it's highly imperative you replicate my library versioning/setting to replicate the error (CUDA, driver, mpi, etc. version as listed previously). One more note is that I changed the cmake file in the /AmgXWrapper-master/AmgXWrapper-master/example/solveFromFiles to 'CMAKE_BUILD_TYPE DEBUG' instead of RELEASE (line 35), though I'm pretty sure I was getting the same error message # 4 with release mode as well.

Lastly, I am attaching another series of files, but this time generated previously by A100 with only 1 mpi rank, and maybe you can try that as well (given your own GPU type). It crashes with the standalone solver using mpirun -n 1 but works with n>1 and using more than 1 GPU.

Of course, for me increasing the -Nruns doesn't help (given the error that I do get).

Regarding the Poisson problem, did you mean if it works in general (irrespective of the my inputs)? If so, even the solveFromFiles does work with the input I provided that are from 'early' time steps (regardless of runtime settings). This is almost as if the solver cares about the history or origin of these files saved right before the crash -- like under what operational settings they have been generated -- while they are all from the same code platform!

I would still appreciate any further insights from you. Regards, Amin FromDGX.zip

piyueh commented 3 years ago

Just an update of some more test results:

  1. Using the data from FromDGX.zip did not give me any error at all. However, if, in the solver configuration file, I change the selector from HMIS to PMIS, I got the double-free error. Nevertheless, I didn't get the error message 4 for all configurations.

  2. Using V100 GPU and CUDA 10.2 to compile the code, everything worked fine. No error at all. Unfortunately, CUDA 10.2 does not seem to support A100, so I couldn't test A100 + CUDA 10.2.

  3. Using CUDA 11.0 and 11.1, no matter the GPU is V100 or A100, I got the double free error message.

Other dependencies: OpenMPI version is 4.1.1, PETSc 3.15, amgx 2.2.0, amgxwrapper both v1.5 and the latest git commit, gcc 7.5, driver version 450.119.04 for V100 and 450.102.04 for A100.

I couldn't match the gcc and driver versions because I don't have control over the machines. However, from the current test results, it looks like the key is CUDA version (or says, cuSparse version).

aminamooie commented 3 years ago

Some of these behavior now look similar to what I had described in my original, rather lengthy, post:

1) "The error gets delayed when I choose HMIS selector instead of PMIS for instance". That was why I provided the config file I was using. This is because for some reason the config file that exists for the poisson problem is based on PMIS and the one in the solveFromFiles is based on HMIS. I believe PMIS should be used for Classical method based on the AmgX reference (more compatibility I guess). That said, maybe you already have done this, but that would be helpful if you could repeat the tests on the A_32_2gpus.dat case with PMIS (and not the seemingly more stable HMIS) to see if you receive the double-free error when using 1 GPU (because my error goes away when doing that and appears when using 2 GPUs -- your original reply showed an almost opposite behavior).

2) It was my own experience as well, as mentioned originally, that switching between CUDA versions make a world of difference (and that cuda 10 was better at times). This is honestly part of my frustration so far....the shear amount of sensitivity to software and hardware architectures.

3) Maybe I could try to match your dependencies for the gcc and driver.

4) I believe after everything so far, there indeed may be some similarity between my error and yours. I used to get the double-free error too and my driver version was exactly 450! Not sure if that's the real difference here, but I think there are connections here. That said, how would one go about debugging this anyway?! I have put extensive time on this and haven't been successful. I even did memory leak check with valgrind on the solveFromFile example and I saw all 4 kinds of memory leak (including definitely lost memory and possibly lost one)! But I did try the same exercise with one of PETSc examples too and got memory leak there too (which was doubly surprising). After research on that, Barry from PETSc community said in one of his replies that a lot of those are OS-level issues that valgrind is not happy with. So I guess that doesn't help us here then! Anyhow, my hope was/is to get some help here. I also wasn't sure if I should have posted this in AmgX github repo, too, since the way of reproducing it was entirely related to the AmgXWrapper repo.

aminamooie commented 3 years ago

Also, as a follow up on my last comment above, I just found a copy of my error message similar to yours, which I was getting before. Looking at the parts in bold it seems like the amgx::handle_signals is involved that reminds me of my current error (illegal handle....)

Here it is:

Using Normal MPI (Hostbuffer) communicator...

free(): double free detected in tcache 2

Caught signal 6 - SIGABRT (abort)

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::handle_signals(int)+0x1e3

/lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x153c0

/lib/x86_64-linux-gnu/libc.so.6 : gsignal()+0xcb

/lib/x86_64-linux-gnu/libc.so.6 : abort()+0x12b

/lib/x86_64-linux-gnu/libc.so.6 : ()+0x903ee

/lib/x86_64-linux-gnu/libc.so.6 : ()+0x9847c

/lib/x86_64-linux-gnu/libc.so.6 : ()+0x9a0ed

/usr/local/cuda/lib64/libcusparse.so.11 : cusparseDestroy()+0x35

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::CSR_Multiply_Impl<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::cusparse_multiply(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >)+0x702f

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::CSR_Multiply_Impl<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::multiply(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >)+0xa47

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::CSR_Multiply_Impl<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::galerkin_product(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >)+0x109

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::CSR_Multiply<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::csr_galerkin_product(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > const&, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, amgx::Vector<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)2, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >, void*)+0x611

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::classical::Classical_AMG_Level<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::computeAOperator_1x1()+0x5b6

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::classical::Classical_AMG_Level_Base<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::computeAOperator()+0x55

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::classical::Classical_AMG_Level_Base<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::createCoarseMatrices()+0x215

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::AMG_Level<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > amgx::AMG_Setup<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::setup<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >(amgx::AMG<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>, amgx::AMG_Level<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >*&, int, bool)+0x25d

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : void amgx::AMG_Setup<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::setup<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>, (AMGX_MemorySpace)1, (AMGX_MemorySpace)0>(amgx::AMG<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>*, amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&)+0xef

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::AMG<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::setup(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&)+0xeb

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::AlgebraicMultigrid_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::solver_setup(bool)+0x67

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::setup(amgx::Operator<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, bool)+0x1f3

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::PCG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::solver_setup(bool)+0x187

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::setup(amgx::Operator<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, bool)+0x1f3

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::setup_no_throw(amgx::Operator<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&, bool)+0x80

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::setup(amgx::Matrix<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >&)+0x60

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : amgx::AMGX_ERROR amgx::(anonymous namespace)::set_solver_with_shared<(AMGX_Mode)8193, amgx::AMG_Solver, amgx::Matrix>(AMGX_solver_handle_struct, AMGX_matrix_handle_struct, amgx::Resources, amgx::AMGX_ERROR (amgx::AMG_Solver<amgx::TemplateMode<(AMGX_Mode)8193>::Type>::)(std::shared_ptr<amgx::Matrix<amgx::TemplateMode<(AMGX_Mode)8193>::Type> >))+0x3eb

/home/aminamooie/Software/AmgX2.2_cuda11.0/build/libamgxsh.so : AMGX_solver_setup()+0x474

./teton_gpu : AmgXSolver::setA(int, int, int, int const, int const, double const, int const)+0x22b

./teton_gpu : PNM::linear_solver_petsc(std::vector<double, std::allocator > const&, std::vector<double, std::allocator > const&, std::vector<double, std::allocator >&, std::vector<unsigned int, std::allocator > const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::vector<unsigned int, std::allocator > const&, std::vector<unsigned int, std::allocator > const&, int, unsigned int, double&, int&)+0x5a4

./teton_gpu : PNM::PressureSolver::solvePressureUnSteadyStatePetscCoInj(bool, std::pair<double, double>, std::pair<double, double>, std::pair<double, double>&, std::pair<double, double>&)+0x13f6

./teton_gpu : PNM::PNMOperation::findUSSPressField(bool, std::pair<double, double>, std::pair<double, double>)+0x110

./teton_gpu : PNM::PNMOperation::convergePressField()+0x539

./teton_gpu : PNM::WeakDynSimulation::run()+0xe9d

./teton_gpu : PNM::Simulation::execute()+0x3e

./teton_gpu : Application::exec()+0x9e1

./teton_gpu : main()+0x1b3

/lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0xf3

./teton_gpu : _start()+0x2e

piyueh commented 3 years ago

The final message is the same (i.e., double-free error) but comes from different locations.

Your double-free error comes from the sparse-matrix multiplication from AMGX. I think the error happens probably when the code executes this line: https://github.com/NVIDIA/AMGX/blob/77f91a94c05edbf58349bad447bbface7207c2b4/base/src/csr_multiply.cu#L506

However, for solveFromFiles, the double-free error comes from the AMGX Resources destructor: https://github.com/NVIDIA/AMGX/blob/77f91a94c05edbf58349bad447bbface7207c2b4/base/src/resources.cu#L167

The major difference is that your double-free error probably happens during a simulation, so it crashes the simulation. For the double-free error from solveFromFiles, it happens at the end of the program, so it doesn't affect the solving.

My wild guess is (i.e., no proof) is something changed in cuSparse 11.0, but AMGX does not change the code accordingly. Probably something related to when creating a cuSparse handle. In the old cuSparse version, when creating a handle, each handle is a brand-new handle, while in cuSparse 11.0, when creating a handle, it may give just a reference/pointer to an existing handle. But this is just my wild guess.

I agree this is frustrating. I think the problem comes from AMGX or even cuSparse. But to reproduce the problem we have to use AmgXWrapper because the data are from PETSc. To open an issue on AMGX's repo, I think we have to first find a way to prove that the issue is not from AmgXWrapper. Or even better, provide a reproducible case that does not need PETSc nor AmgXWrapper. This is what I'm trying to do now.

piyueh commented 3 years ago

Update: AMGX uses different cusparse_multiply for CUDA 10.2 and CUDA 11.1: When compiling with CUDA 10.2, the cusparse_mutiply used is defined at: https://github.com/NVIDIA/AMGX/blob/77f91a94c05edbf58349bad447bbface7207c2b4/base/src/csr_multiply.cu#L513-L602

When compiling with CUDA >= 11.0, the cusparse_multiply used is defined at: https://github.com/NVIDIA/AMGX/blob/77f91a94c05edbf58349bad447bbface7207c2b4/base/src/csr_multiply.cu#L408-L509

You can see from the version of cusparse_multiply used by CUDA 10.2 does not destroy the suSparse handle at the end of function. But the version for CUDA >= 11.0 destroys the handle.

aminamooie commented 3 years ago

These are some wonderful insights!! It makes a lot of sense! I know this was supposed to be no solution but I could remove the -DCUSPARSE_USE_GENERIC_SPGEMM from the AmgX CMakeLists under IF(CUDA_VERSION_MAJOR MATCHES 11), and that temporarily solved a thing or two during the standalone solve (where it was crashing for me with the illegal handle error). However, running the simulation still resulted in one other new crash as a result of this manipulation, as follows:

Thrust failure: transform: failed to synchronize: cudaErrorInvalidAddressSpace: operation not supported on global/shared address space
File and line number are not available for this exception.
AMGX ERROR: file /home/aminamooie/AMGX-2.2/AMGX-main/base/src/amgx_c.cu line   2733
AMGX ERROR: Thrust failure.

I completely agree with you that this now should be brought to AmgX developers' attention in a sensible way. That's of course more than appreciated if you could achieve that, but I will try my best too. Please keep me posted like you have been! Thanks a lot.

aminamooie commented 3 years ago

One side thought: if you remember, with Aggregation method everything went fine across all problem sizes. And I believe if you try that config even your error will disappear too (?). But I can't understand how this is possible while the cusparse and amgx way of handling things presumably have that problem? In other words, how does that aspect not affect the aggregation method? My own naive guess so far is that whatever makes the solver 'slower' makes it more robust (superficially at least) like using valgrind, hmis, aggressive levels, and with aggregation it's indeed the slowest thereby much less prone to failure!

piyueh commented 3 years ago

Update: I got error message 4 with CUDA 11.3.0 and A_32_2gpus.dat.

piyueh commented 3 years ago

@aminamooie I was able to create a MatrixMarket file from A_32_2gpus.dat and fed it to AMGX's example amgx_capi and amgx_mpi_capi. I got error message 4 with CUDA 11.3.0. I created an issue at AMGX: https://github.com/NVIDIA/AMGX/issues/148

aminamooie commented 3 years ago

That is wonderful progress! I was planning to do exactly that and was hoping to obtain this exact finding. But I'm so glad you did it perfectly yourself and am appreciative of you for your time and efforts. Let's see what happens. Fingers crossed.