Closed pledac closed 1 week ago
The issue seems to come from cuSPARSE library, cause with Cuda>11.2 and using only 11.2 version for libcusparse, it works. So putting libcusparse.so.11.3.1.68 (as libcusparse.so.11) along libamgxsh.so in the same directory is a quick fix for me now, as I really need classical AMG.
Did you happen to try PMIS instead of HMIS?
I would be interested in hearing about your use case. Would you be happy to start a private email thread?
Thanks Matt for the replay, ready to discuss privately and share my use case. In the same time, I will try PMIS with Classical AMG. I forgot to say I tried 2.2.0, 2.3.0 and very last main AmgX version without success to fix this issue.
Did you happen to try PMIS instead of HMIS?
PMIS is same than HMIS for this issue.
Unhappily, I can't reproduce the issue with AmgX
You mean that solving same matrix using same solver configuration using one of the examples yield different result? First it would be great to confirm that matrix is partitioned as expected using the way you upload it to AMGX (i.e. comparing to AMGX example). On the same time it would be great to try to simplify solver config to the point when the result will match the output of version with cusparse 11.2 - this will help narrow down where something might went wrong. (i.e. reduce number of levels up to 2, try changing smoothers, other solver parameters)
I mean that: a) The issue (C-AMG with MultiGPU on Cuda>11.2) happens in my code with every different kind of matrix b) I can't reproduce the issue when providing with the same config a matrix to solveFromFiles (AmgXWrapper tool) which call AmgX.
-> So I guess something is slightly different in the way my code is built with AmgX/Cuda>11.2 compared to the build of AmgX or AmgXWrapper alone, as neither AmgX nor AmgXWrapper have the same issue in my tests. So I change the title to clarify.
I will try to reduce solver config, thanks.
What alarms me is that, if i understand correctly, changing cusparse library changes the behaviour.
I can't reproduce the issue when providing with the same config a matrix to solveFromFiles (AmgXWrapper tool) which call AmgX. Other than config it's important to match how matrices are distributed across ranks - this is likely trigger different paths. Do you use AmgXWrapper in your code too? What API do you use to upload matrix to the GPU?
What alarms me is that, if i understand correctly, changing cusparse library changes the behaviour.
Yes using CUDA 11.2 cusparse (with LD_LIBRARY_PATH) CHANGES the behaviour in my case, I confirm. It is my only solution for the moment.
I can't reproduce the issue when providing with the same config a matrix to solveFromFiles (AmgXWrapper tool) which call AmgX. Other than config it's important to match how matrices are distributed across ranks - this is likely trigger different paths. Do you use AmgXWrapper in your code too? What API do you use to upload matrix to the GPU?
Yes I am using AmgXWrapper in my code with the new API from Matt to upload matrix under CSR format:
...
petscToCSR(MatricePetsc_, SolutionPetsc_, SecondMembrePetsc_);
SolveurAmgX_.setA(nRowsGlobal, nRowsLocal, nNz, rowOffsets, colIndices, values, nullptr);
...
SolveurAmgX_.solve(lhs, rhs, nRowsLocal);
To be sure, I just run again the poisson AmgXWrapper test with AmgX_CSR API on 2 GPUs with CUDA 11.4 and my configfile (with Classical AMG), and it works fine.
So, there is something wrong in my code, which only produces an issue for cuSPARSE>11.2... It has been teasing me for more than one year :-(
I will think about your sentence: "it's important to match how matrices are distributed across ranks - this is likely trigger different paths"
Just to note - it's possible that cusparse internal implementation also did change. If it produced a regression or a bug - it would be great to try catching it.
From cusparse only two things are used in the AMGX really (and also only conditionally) - SpMV and SpMM.
To remove spmv from suspicion, if it's impossible to export matrix - one thing you can try comparing standalone SPMV on your matrix specifically - using AMGX_matrix_vector_multiply
API (example here: https://github.com/NVIDIA/AMGX/blob/main/examples/amgx_spmv_test.c, but replace https://github.com/NVIDIA/AMGX/blob/main/examples/amgx_spmv_test.c#L264 to CLASSICAL
to build classical solver-like redistribution). There is no interface to that function in AMGXWrapper, but it can be added there.
Then you can compare CUDA <= 11.2 vs CUDA > 11.2. If result will be different - then it's something we can work further with cuSparse team.
If everything will seem normal - we can think about what could be wrong with SpMM, but debugging this will need more effort.
Thanks, I will have a look and experiment with the amgx_spmv_test.c as soon as I get some time and will report.
Just to say, the issue has gone with my code. And can't say why :-)
We are using AmgX (through AmgXWrapper) for 2 years now, but we are facing an annoying issue. Our code runs fine with CG solver and Aggegated or Classical AMG preconditioner, in parallel with one or more GPU.
But with Cuda version>11.2, whereas Aggregated AMG stills works, Classical AMG fails to converge with n GPUs (n>1):
Here is the config file used:
Unhappily, I can't reproduce the issue with AmgX or AmgXWrapper samples.
Do anyone notice too this issue ?
Thanks