Open b-grimaud opened 2 years ago
I don't see much in that error to tell us what's happening. Can you post a little more to take a look at.
Also, the windows special case for gpu is handled here: https://github.com/bodono/scs-python/blob/master/setup.py#L206
First, is the 'CUDA_PATH' env variable set? And if so you should make sure that the include and lib directories are as SCS expect (otherwise you can edit the code to point to the right place, if it's a general fix I would be happy to accept a PR).
By the way, it is often the case that the GPU version is not actually faster than the vanilla direct version, so bear that in mind.
So, the dirty way to fix this was to point directly to an environment install of CUDA, as CUDA_PATH otherwise points to the regular Windows install. I don't know if there's a way to make it work otherwise, because both os.environ['CUDA_PATH']
and os.getenv('CUDA_PATH')
point to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5
I manually edited include_dirs
and library_dirs
to point to 'C:\\Users\_user_\\anaconda3\\envs\\_myenv_\\include'
and 'C:\\Users\_user_\\anaconda3\\envs\\_myenv_\\Lib\\x64'
.
I then encountered a fatal error LNK1158: cannot run ‘rc.exe’
error, which I solved by following this, and that was good enough to have SCS installed in this environment.
Now, trying to solve with gpu=True
, I first encountered NotImplementedError GPU direct solver not yet available, pass use_indirect=True
, which was indeed solved by passing that argument.
And now, I get ImportError DLL load failed while importing _scs_gpu: The specified module could not be found.
. No further info on what DLL might be missing, but the install might not be as complete as I thought. Solving on CPU works just fine otherwise.
I assume this is still related to the compiling process, but I can move it to a new issue if needed.
I will definitely benchmark CPU and GPU performances once it is working, I'll keep you updated.
It sounds like you got the paths right for the install, so you probably need to add the paths where the cuda binaries live to the PATH variable (or whatever the equivalent is for windows), eg
set PATH=C:\Users\_user_\anaconda3\envs\_myenv_\Lib\x64;%PATH%
That did the trick ! I added C:\Users\_user_\anaconda3\envs\_myenv_\Lib\x64
to the (USER, not SYSTEM) PATH variable.
The solver now works on some problems, but crashes on others :
===============================================================================
CVXPY
v1.1.17
===============================================================================
(CVXPY) Dec 03 08:52:03 AM: Your problem has 500 variables, 1 constraints, and 0 parameters.
(CVXPY) Dec 03 08:52:03 AM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Dec 03 08:52:03 AM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Dec 03 08:52:03 AM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
-------------------------------------------------------------------------------
Compilation
-------------------------------------------------------------------------------
(CVXPY) Dec 03 08:52:03 AM: Compiling problem (target solver=SCS).
(CVXPY) Dec 03 08:52:03 AM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCS
(CVXPY) Dec 03 08:52:03 AM: Applying reduction Dcp2Cone
(CVXPY) Dec 03 08:52:03 AM: Applying reduction CvxAttr2Constr
(CVXPY) Dec 03 08:52:03 AM: Applying reduction ConeMatrixStuffing
(CVXPY) Dec 03 08:52:03 AM: Applying reduction SCS
(CVXPY) Dec 03 08:52:03 AM: Finished problem compilation (took 3.290e-02 seconds).
-------------------------------------------------------------------------------
Numerical solver
-------------------------------------------------------------------------------
(CVXPY) Dec 03 08:52:03 AM: Invoking solver SCS to obtain a solution.
------------------------------------------------------------------
SCS v3.0.0 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem: variables n: 998, constraints m: 1497
cones: l: linear vars: 993
q: soc vars: 504, qsize: 2
settings: eps_abs: 1.0e-05, eps_rel: 1.0e-05, eps_infeas: 1.0e-07
alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
max_iters: 10000, normalize: 1, warm_start: 0
acceleration_lookback: 0, acceleration_interval: 0
lin-sys: sparse-indirect GPU
nnz(A): 4473, nnz(P): 0
** On entry to cusparseCreate(): CUDA context cannot be initialized
** On entry to cusparseCreateCsr() parameter number 5 (csrRowOffsets) had an illegal value: NULL pointer
** On entry to cusparseCreateDnVec() parameter number 3 (values) had an illegal value: NULL pointer
** On entry to cusparseCreateDnVec() parameter number 3 (values) had an illegal value: NULL pointer
** On entry to cusparseCreateDnVec() parameter number 3 (values) had an illegal value: NULL pointer
** On entry to cusparseCreateCsr() parameter number 5 (csrRowOffsets) had an illegal value: NULL pointer
scs/linsys/gpu/indirect\private.c:346:scs_init_lin_sys_work
ERROR_CUDA (*): an illegal memory access was encountered
ERROR: init_lin_sys_work failure
Failure:could not initialize work
I recursively try to solve problems of varying complexity, and using data of varying sizes, so I would understand if GPU support is more fit towards solving single, larger problems.
I also recently found out about the warm_start
argument in CVXPY, I don't know if that could apply here ?
Great, glad it's (kind of) working for you!
It sounds like the solver has a GPU memory leak if this is happening after some number of solves, is it easy enough to send me the script that runs this?
Sure ! The script itself involves several modules, but the actual problem solving part is as follows :
def acceleration_minimization_norm1(measure, sigma0,px, nn = 0):
"""
Parameters
----------
measure : array (n, 2)
experimental data (noisy)
sigma0 : int
estimated precision of localization (in nanometers)
px : int
pixel size (in micrometers)
nn : int, optional
number of data points discarded at the extremities of the solution
Returns
-------
solution : array (n-2*nn, 2)
filtered solution with minimization of norm 1 of the acceleration with the difference between measured data and solution inferior or equal to the theoretical noise.
"""
measure = px*measure
n = len(measure)
variable = cp.Variable((n, 2))
objective = cp.Minimize(cp.atoms.norm1(variable[2:,0]+variable[:-2,0] - 2*variable[1:-1,0])+cp.atoms.norm1(variable[2:,1]+variable[:-2,1] - 2*variable[1:-1,1]))
constraints = [ cp.atoms.norm(variable - measure, 'fro')**2 <= n*sigma0**2*10**-6]
prob = cp.Problem(objective, constraints)
prob.solve(solver='SCS',verbose=True,gpu=True,use_indirect=True,max_iters=10000)
solution = variable.value
if nn == 0:
return solution
else:
return solution[nn:n-nn]
That function is then called as part of another module that loops over CSV files containing data. Thanks a lot for the help !
I've profiled the code quite deeply now and I don't see a memory leak anywhere. Does it always crash on the same problem? Does it crash if the problem is called outside of the loop?
Also, could you cd scs
directory (inside scs-python
) and tell me what commit hash you are at, using git log
.
Here's the full output of git log
:
commit 807a79e6a36079d11da4db1dff54aeb56b1beb21 (HEAD -> master, tag: 3.0.0, origin/master, origin/HEAD)
Author: Brendan O'Donoghue <bodonoghue85@gmail.com>
Date: Sun Oct 10 13:56:44 2021 +0100
pull in gpu fixes
commit 6a7bfb43307efcae40c75a713b95a8fd93136ba2
Author: Brendan O'Donoghue <bodonoghue85@gmail.com>
Date: Sun Oct 3 23:58:15 2021 +0100
fix seg fault
commit a52260ee635977dd59653eed35e610a46db55bd3
Author: Brendan O'Donoghue <bodonoghue85@gmail.com>
Date: Sun Oct 3 23:43:44 2021 +0100
update to latest scs
commit da854af76e04d0dcb7a56de876e1451c0749d968
Author: Brendan O'Donoghue <bodonoghue85@gmail.com>
Date: Sun Oct 3 13:55:10 2021 +0100
update badge link
commit b9972fe8400e6cdee5e85be251ac8029102db2ec
Author: Brendan O'Donoghue <bodonoghue85@gmail.com>
Date: Sun Oct 3 13:52:35 2021 +0100
update to latest scs
Running problems independently, out of the loop, seems to prevent such crashes. Some of them did cause similar crashes after solving the problem, not before, if I let them run for a very high number of iterations, but I haven't been able to reliably reproduce this situation.
While looking at the verbose output in details, I found out that it doesn't look like the problems are actually getting solved, some metrics stay the same even if I push the maximum iterations to run for a while, and CVXPY always end up finding a solution that is "unbounded".
------------------------------------------------------------------
iter | pri res | dua res | gap | obj | scale | time (s)
------------------------------------------------------------------
0| 3.16e+04 4.93e+02 1.50e+06 -8.08e+05 1.00e-01 2.22e-01
250| 8.59e+16 0.00e+00 3.47e-01 1.73e-01 1.00e+06 6.51e-01
500| 8.59e+16 0.00e+00 5.20e-01 2.60e-01 1.00e+06 1.07e+00
750| 8.59e+16 0.00e+00 3.87e-01 1.94e-01 1.00e+06 1.50e+00
1000| 8.59e+16 0.00e+00 3.97e-01 1.98e-01 1.00e+06 1.92e+00
|
90000| 8.59e+16 0.00e+00 5.56e-01 2.78e-01 1.00e+06 1.52e+02
90250| 8.59e+16 0.00e+00 4.44e-01 2.22e-01 1.00e+06 1.52e+02
90500| 8.59e+16 0.00e+00 5.06e-01 2.53e-01 1.00e+06 1.53e+02
|
99000| 8.59e+16 0.00e+00 4.35e-01 2.18e-01 1.00e+06 1.65e+02
99250| 8.59e+16 0.00e+00 5.24e-01 2.62e-01 1.00e+06 1.66e+02
99500| 8.59e+16 0.00e+00 2.32e-02 1.16e-02 1.00e+06 1.66e+02
99750| 8.59e+16 0.00e+00 5.52e-01 2.76e-01 1.00e+06 1.66e+02
100000| 8.59e+16 0.00e+00 1.59e-01 -7.94e-02 1.00e+06 1.67e+02
------------------------------------------------------------------
status: unbounded (inaccurate - reached max_iters)
timings: total: 1.68e+02s = setup: 9.60e-01s + solve: 1.67e+02s
lin-sys: 1.02e+02s, cones: 1.95e+01s, accel: 0.00e+00s
------------------------------------------------------------------
objective = -inf (inaccurate)
------------------------------------------------------------------
Or, on some problems :
-------------------------------------------------------------------------------
Summary
-------------------------------------------------------------------------------
(CVXPY) Dec 15 05:27:32 PM: Problem status: optimal_inaccurate
(CVXPY) Dec 15 05:27:32 PM: Optimal value: 0.000e+00
(CVXPY) Dec 15 05:27:32 PM: Compilation took 3.780e-02 seconds
(CVXPY) Dec 15 05:27:32 PM: Solver (including time spent in interface) took 4.099e+02 seconds
The problems that result in "unbonded" rather than just "inaccurate" are also noticeably slower to reach the same number of iterations.
Thanks for sending this. I ran your code for about a week continuously on randomly generated data on my own GPU machine and was unable to reproduce this. However, examining your output it looks like the data types are getting confused, eg the GPU is expecting a particular integer or floating point width and it's getting passed something different.
For one of these problems instances where it takes a very long time to solve could you pass the argument write_data_filename=tmp
to the solver and then email me the dumped tmp
file? It contains all the data that SCS needs to solve the problem.
I'll try running it with different data on my side, and see how it goes !
I'll send the tmp file by mail, I let it run for 10 000 iterations to shorten wait time.
Running the data you sent me using my GPU I get:
Reading data from /usr/local/google/home/bodonoghue/Downloads/tmp
------------------------------------------------------------------
SCS v3.0.0 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
------------------------------------------------------------------
problem: variables n: 41586, constraints m: 62379
cones: l: linear vars: 41581
q: soc vars: 20798, qsize: 2
settings: eps_abs: 1.0e-05, eps_rel: 1.0e-05, eps_infeas: 1.0e-07
alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1
max_iters: 10000, normalize: 1, warm_start: 0
acceleration_lookback: 10, acceleration_interval: 10
lin-sys: sparse-indirect GPU
nnz(A): 187119, nnz(P): 0
------------------------------------------------------------------
iter | pri res | dua res | gap | obj | scale | time (s)
------------------------------------------------------------------
0| 1.73e+02 1.00e+00 6.58e+05 -3.29e+05 1.00e-01 2.82e-01
250| 1.20e+00 3.74e-02 3.95e+01 2.24e+02 1.00e-01 8.04e+00
500| 5.62e-01 1.82e-02 2.11e+01 3.01e+02 1.00e-01 1.61e+01
750| 3.60e-01 4.69e-03 7.87e+00 3.08e+02 1.00e-01 2.41e+01
1000| 2.68e-01 3.36e-03 1.25e+01 3.14e+02 1.00e-01 3.18e+01
1250| 2.13e-01 1.91e-03 1.54e+01 3.19e+02 1.00e-01 3.98e+01
1500| 1.75e-01 1.63e-03 1.07e+01 3.27e+02 1.00e-01 4.83e+01
1750| 1.52e-01 1.55e-03 1.08e+01 3.30e+02 1.00e-01 5.66e+01
2000| 2.38e-01 1.53e-02 6.32e+00 3.36e+02 1.00e-01 6.48e+01
2250| 1.14e-01 4.21e-03 3.65e+00 3.39e+02 1.00e-01 7.29e+01
2500| 1.03e-01 1.07e-03 1.12e+01 3.37e+02 1.00e-01 8.11e+01
2750| 9.39e-02 9.07e-04 7.41e+00 3.40e+02 1.00e-01 8.92e+01
3000| 9.06e-02 9.33e-04 7.44e+00 3.40e+02 1.00e-01 9.72e+01
3250| 8.05e-02 9.27e-04 7.37e+00 3.41e+02 1.00e-01 1.05e+02
3500| 6.97e-02 9.29e-04 3.62e+00 3.44e+02 1.00e-01 1.14e+02
3750| 6.31e-02 6.35e-04 3.77e+00 3.45e+02 1.00e-01 1.22e+02
4000| 5.75e-02 6.22e-04 3.42e+00 3.46e+02 1.00e-01 1.30e+02
4250| 5.13e-02 5.53e-04 4.41e+00 3.46e+02 1.00e-01 1.39e+02
4500| 4.67e-02 4.88e-04 3.52e+00 3.47e+02 1.00e-01 1.47e+02
4750| 4.29e-02 4.99e-04 2.90e+00 3.48e+02 1.00e-01 1.55e+02
5000| 3.91e-02 5.19e-04 3.93e+00 3.47e+02 1.00e-01 1.64e+02
5250| 3.57e-02 4.60e-04 2.54e+00 3.49e+02 1.00e-01 1.72e+02
5500| 8.90e-02 5.66e-03 2.69e+00 3.49e+02 1.00e-01 1.80e+02
5750| 2.93e-02 1.90e-03 2.50e+00 3.49e+02 1.00e-01 1.89e+02
6000| 2.69e-02 1.34e-03 2.74e+00 3.49e+02 1.00e-01 1.97e+02
6250| 2.49e-02 5.10e-04 1.95e+00 3.50e+02 1.00e-01 2.05e+02
6500| 2.32e-02 2.78e-04 7.21e-01 3.50e+02 1.00e-01 2.13e+02
6750| 2.33e-01 1.49e-02 1.38e+00 3.50e+02 1.00e-01 2.21e+02
7000| 1.98e-02 2.44e-04 8.66e-01 3.51e+02 1.00e-01 2.30e+02
7250| 1.86e-02 2.20e-04 8.51e-01 3.51e+02 1.00e-01 2.38e+02
7500| 1.72e-02 2.37e-04 1.76e+00 3.50e+02 1.00e-01 2.47e+02
7750| 1.61e-02 1.76e-04 1.55e+00 3.50e+02 1.00e-01 2.55e+02
8000| 1.48e-02 1.65e-04 1.33e+00 3.51e+02 1.00e-01 2.63e+02
8250| 1.39e-02 1.57e-04 6.87e-01 3.51e+02 1.00e-01 2.71e+02
8500| 1.30e-02 1.61e-04 6.60e-01 3.51e+02 1.00e-01 2.79e+02
8750| 1.28e-02 1.92e-04 9.15e-01 3.51e+02 1.00e-01 2.87e+02
9000| 1.23e-02 1.61e-04 3.27e-01 3.51e+02 1.00e-01 2.95e+02
9250| 1.16e-02 1.59e-04 1.04e+00 3.51e+02 1.00e-01 3.03e+02
9500| 1.08e-02 1.40e-04 7.23e-01 3.51e+02 1.00e-01 3.11e+02
9750| 1.01e-02 1.42e-04 8.37e-01 3.51e+02 1.00e-01 3.20e+02
10000| 9.33e-03 1.40e-04 6.07e-01 3.51e+02 1.00e-01 3.28e+02
------------------------------------------------------------------
status: solved (inaccurate - reached max_iters)
timings: total: 3.29e+02s = setup: 7.37e-01s + solve: 3.28e+02s
lin-sys: 3.06e+02s, cones: 4.48e+00s, accel: 3.43e+00s
------------------------------------------------------------------
objective = 351.298981 (inaccurate)
------------------------------------------------------------------
In other words it's clearly different to what you're getting and appears to be working correctly. My guess is that something is wrong in the types we assume that CUDA is using, but only for some versions of CUDA or some GPUs, see similar issue here: https://github.com/bodono/scs-python/issues/54.
I would recommend you stick to the cpu direct version for now.
Specifications
Description
I am trying to make use of a GPU to speed up SCS, but unfortunately the GPU-equipped machine I have access to is shared, and I have to install it on Windows.
It seems that Visual Studio C++ and Windows 10 SDK are required to compile, but apparently that doesn't work. The only answer I could find related to that suggested removing Visual Studio entirely which, unsurprisingly, doesn't work. Building from source without any options (
python setup.py install
) seems to work, so the issue might be GPU related.How to reproduce
As instructed in the docs :
Additional information
I understand that SCS probably hasn't been tested or used on Windows that much, especially for GPU uses. I am asking in case someone did manage to compile from source, with GPU, outside of Linux. The environment I'm using is Python 3.8.12 with
cudatoolkit 10.1.243
andcudnn 7.6.5
, installed as part oftensorflow-gpu
. CUDA works fine with ML uses in that environment.Output
The entire output is (very) verbose, but here's the final part :
error: Command "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -DPYTHON -DCTRLC=1 -DCOPYAMATRIX -DGPU_TRANSPOSE_MAT=1 -DPY_GPU -DINDIRECT=1 -Iscs/include -Iscs/linsys -IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5/include -Iscs/linsys/gpu/ -Iscs/linsys/gpu/indirect -IC:\Users\M T\anaconda3\envs\scs_gpu\lib\site-packages\numpy\core\include -IC:\Users\M T\anaconda3\envs\scs_gpu\include -IC:\Users\M T\anaconda3\envs\scs_gpu\include -IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE -IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\winrt -IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE -IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\winrt /Tcscs/linsys/gpu\gpu.c /Fobuild\temp.win-amd64-3.8\Release\scs/linsys/gpu\gpu.obj -O3" failed with exit status 2