cvxgrp / scs

Splitting Conic Solver
MIT License
549 stars 135 forks source link

Expected speed Question #250

Closed caglarari closed 1 year ago

caglarari commented 1 year ago

Hi, I was reading the long SCS paper https://web.stanford.edu/~boyd/papers/pdf/scs_long.pdf and on page 14, it says SCS direct took 21.9 sec for a problem with p=10000 variables & q=2000 measurements.

On page 13, it says "All the experiments were carried out on a system with 32 2.2GHz cores and 512Gb of RAM, running Linux. (The single-threaded versions, of course, do not make use of the multiple cores.)"

As far as I understand SCS direct is single-threaded.

Yet on my old laptop (intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz, 16,0 GB, windows 10), solving the same problem via cvxpy or directly using SCS (single threaded), it takes at least 289 sec (it takes more than 400 sec for bigger lambda).

What am I missing? Why is there such a speed difference (22 sec vs 289 sec) ?

Here is the code I used for cvxpy import cvxpy as cp import numpy as np

def generate_data(m=100, n=20, sigma=5, density=0.2): "Generates data matrix X and observations Y." np.random.seed(1) beta_star = np.random.randn(n) idxs = np.random.choice(range(n), int((1-density)*n), replace=False) for idx in idxs: beta_star[idx] = 0 X = np.random.randn(m,n) Y = X.dot(beta_star) + np.random.normal(0, sigma, size=m) return X, Y, beta_star

m = 2000 n = 10000 sigma = 5 density = 0.2

Xlas, Ylas, _ = generate_data(m, n, sigma) Xlas_train=Xlas Ylas_train=Ylas

beta = cp.Variable(n) lambd = cp.Parameter(nonneg=True) obj=cp.norm2(Xlas_train @ beta - Ylas_train)*2+lambd cp.norm1(beta)

problem = cp.Problem(cp.Minimize(obj)) lambd.value=1.0 rets=problem.solve(solver=cp.SCS,verbose=True)

Here's the cvxpy & SCS outputs

                                 CVXPY                                     
                                 v1.3.1                                    

=============================================================================== (CVXPY) Apr 10 01:57:38 AM: Your problem has 10000 variables, 0 constraints, and 1 parameters. (CVXPY) Apr 10 01:57:38 AM: It is compliant with the following grammars: DCP, DQCP (CVXPY) Apr 10 01:57:38 AM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.

                              Compilation                                  

(CVXPY) Apr 10 01:57:38 AM: Compiling problem (target solver=SCS). (CVXPY) Apr 10 01:57:38 AM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCS (CVXPY) Apr 10 01:57:38 AM: Applying reduction Dcp2Cone (CVXPY) Apr 10 01:57:38 AM: Applying reduction CvxAttr2Constr (CVXPY) Apr 10 01:57:38 AM: Applying reduction ConeMatrixStuffing (CVXPY) Apr 10 01:57:49 AM: Applying reduction SCS (CVXPY) Apr 10 01:57:55 AM: Finished problem compilation (took 1.676e+01 seconds). (CVXPY) Apr 10 01:57:55 AM: (Subsequent compilations of this problem, using the same arguments, should take less time.)

                            Numerical solver                               

(CVXPY) Apr 10 01:57:55 AM: Invoking solver SCS to obtain a solution.

       SCS v3.2.3 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012

problem: variables n: 20001, constraints m: 22001 cones: l: linear vars: 20000 q: soc vars: 2001, qsize: 1 settings: eps_abs: 1.0e-05, eps_rel: 1.0e-05, eps_infeas: 1.0e-07 alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1 max_iters: 100000, normalize: 1, rho_x: 1.00e-06 acceleration_lookback: 10, acceleration_interval: 10 lin-sys: sparse-direct-amd-qdldl nnz(A): 20040001, nnz(P): 1

iter | pri res | dua res | gap | obj | scale | time (s)

 0| 1.30e+03  1.00e+02  1.19e+09 -5.95e+08  1.00e-01  3.54e+01 

250| 1.19e-01 3.35e-01 1.38e-01 1.09e+05 5.66e+00 1.04e+02 500| 2.41e-02 5.32e-02 1.09e-01 1.09e+05 5.66e+00 1.20e+02 750| 8.54e-02 4.21e-01 2.31e-01 1.09e+05 5.66e+00 1.35e+02 1000| 5.91e-03 1.40e-02 2.02e-02 1.09e+05 5.66e+00 1.50e+02 1250| 4.61e-03 8.46e-03 3.02e-02 1.09e+05 5.66e+00 1.64e+02 1500| 2.91e-03 6.22e-03 9.72e-03 1.09e+05 5.66e+00 1.80e+02 1750| 2.14e-03 4.57e-03 1.10e-02 1.09e+05 5.66e+00 1.95e+02 2000| 1.58e-03 3.99e-03 5.45e-03 1.09e+05 5.66e+00 2.10e+02 2250| 1.41e-03 2.75e-03 5.20e-03 1.09e+05 5.66e+00 2.25e+02 2500| 1.08e-03 2.28e-03 4.50e-03 1.09e+05 5.66e+00 2.40e+02 2750| 7.56e-04 1.60e-03 3.76e-04 1.09e+05 5.66e+00 2.55e+02 3000| 5.41e-04 1.27e-03 1.71e-03 1.09e+05 5.66e+00 2.69e+02 3250| 4.90e-04 1.02e-03 1.47e-03 1.09e+05 5.66e+00 2.84e+02 3325| 4.61e-04 1.01e-03 2.18e-03 1.09e+05 5.66e+00 2.89e+02

status: solved timings: total: 2.89e+02s = setup: 3.52e+01s + solve: 2.53e+02s lin-sys: 1.91e+02s, cones: 4.05e-01s, accel: 6.92e-01s

objective = 108698.330314


                                Summary                                    

(CVXPY) Apr 10 02:02:44 AM: Problem status: optimal (CVXPY) Apr 10 02:02:44 AM: Optimal value: 1.087e+05 (CVXPY) Apr 10 02:02:44 AM: Compilation took 1.676e+01 seconds (CVXPY) Apr 10 02:02:44 AM: Solver (including time spent in interface) took 2.889e+02 seconds

Thanks, Caglar

Please read this first

If you are have a problem that SCS struggles with you can pass a string to the write_data_filename argument and SCS will dump a file containing the problem to disk. Zip the file and send it to us or attach it to this bug report for easy reproduction of the issue.

A common cause of issues is not linking BLAS/LAPACK libraries correctly. If you are having this issue please search for resources on installing and linking these libraries first. You can try openblas if you need a BLAS library.

Specifications

Description

A clear and concise description of the problem.

How to reproduce

Ideally a minimal snippet of code that reproduces the problem if possible.

Additional information

Extra context.

Output

Entire SCS output including the entire stack trace if applicable.

bodono commented 1 year ago

The main thing is probably the number of non-zeros, the example in the paper has 3.8e6 but yours has 20e6, so about 5x more which will make it at least 5x slower, and probably more like 10x slower since the sparsity is unstructured. By the way, all the examples in the paper were run in matlab and are available here.

caglarari commented 1 year ago

Thanks a lot.

I have observed another speed difference issue. When I use SCS directly, it was significantly slower than using SCS via cvx. I 've observed the same thing with cvxpy too. When I tested SCS via cvxpy using the SCS example codes, it was significantly faster than using SCS directly.

For instance, for l1 regularized logistic regression, for 100 features & 10000 samples, SCS takes only 3.76 secs if I use it via cvx & it takes 16.5 secs (SCS direct) & 19.2 secs (SCS indirect) if I use it directly. Similarly, for 1000 features & 100000 samples, SCS via cvx takes 460 secs while when without cvx, it takes 2220 secs (SCS direct) & 3050 secs (SCS indirect).

Could you tell me what causes this difference in speed & how it can be fixed?

Regards, Caglar

Here is the matlab code

save_results = false; run_cvx = true; cvx_use_solver = 'scs'; run_scs_direct = true; run_scs_indirect = true;

sizes = [100 10000; 1000 100000]; sze_str{1} = 'small'; sze_str{2} = 'med'; density = 0.1;

for i=1:size(sizes,1) disp(sprintf('ScsSolutionving %s l1 regularized logistic regresssion.',sze_str{i})) clearvars -except i sizes sze_str direct_data indirect_data save_results density run_scs_direct run_scs_indirect run_cvx cvx_use_solver

str = ['data/l1logreg_' sze_str{i}];
randn('seed',sum(str));rand('seed',sum(str))

p = sizes(i,1); % features
q = sizes(i,2); % total samples
w_true = sprandn(p,1,0.2);

X_tmp = 3*sprandn(p, q, density);
ips = -w_true'*X_tmp;
ps = (exp(ips)./(1 + exp(ips)))';
labels = 2*(rand(q,1) < ps) - 1;

X_pos = X_tmp(:,labels==1);
X_neg = X_tmp(:,labels==-1);

X = [X_pos -X_neg]; % include labels with data

lam = min(0.01*norm(X*ones(q,1),'inf')/2, 10); % too large gives bad results

clear X_tmp ips ps labels;
%%
c = [zeros(p,1);lam*ones(p,1);ones(q,1);zeros(q,1);zeros(q,1)];
b = [zeros(p,1);zeros(p,1);ones(q,1)];

Anz = nnz(X) + 6*q + 4*p;

%At = zeros(2*p + 3*q,2*p + q + 6*q);
At = sparse([],[],[],2*p + 3*q,2*p + q + 6*q,Anz);
At(:,1:2*p+q) = [speye(p) -speye(p) sparse(p,q) sparse(p,q) sparse(p,q);
    -speye(p) -speye(p) sparse(p,q) sparse(p,q) sparse(p,q);
    sparse(q,p) sparse(q,p) sparse(q,q) speye(q) speye(q)]';
idx = 2*p+q;
for j=1:q
    b = [b;[0;1;0]];
    M1 = sparse(q,3);
    M1(j,1) = 1;
    M2 = sparse(q,3);
    M2(j,3) = -1;
    At(:,idx+1:idx+3) = [sparse(p,3); sparse(p,3); M1; M2; sparse(q,3)];
    idx = idx + 3;
end
for j=1:q
    b = [b;[0;1;0]];
    M1 = sparse(q,3);
    M1(j,1) = 1;
    M2 = sparse(q,3);
    M2(j,3) = -1;
    At(:,idx+1:idx+3) = [[-X(:,j) sparse(p,2)]; sparse(p,3); M1 ; sparse(q,3); M2];
    idx = idx + 3;
end
A = sparse(At');
data.A = A;
data.b = b;
data.c = c;

K.f = 0;
K.l = p+p+q;
K.q = [];
K.s = [];
K.ep = 2*q;
K.ed = 0;

params.verbose = 1;
params.scale = 5;
params.cg_rate = 1.5;

%write_scs_data_sparse(data,K,params,str)

if (run_scs_direct)
    if (save_results);
        [direct_data.output{i}, xd,yd,sd,infod] = evalc('scs_direct(data,K,params);');
        direct_data.output{i}
    else
        [xd,yd,sd,infod]=scs_direct(data,K,params);
    end
    direct_data.x{i} = xd;
    direct_data.y{i} = yd;
    direct_data.s{i} = sd;
    direct_data.info{i} = infod;
    if (save_results);
        save('data/l1logreg_direct', 'direct_data');
    end
end

if (run_scs_indirect)
    if (save_results);
        [indirect_data.output{i},xi,yi,si,infoi] = evalc('scs_indirect(data,K,params);');
        indirect_data.output{i}
    else
        [xi,yi,si,infoi]=scs_indirect(data,K,params);
    end
    indirect_data.x{i} = xi;
    indirect_data.y{i} = yi;
    indirect_data.s{i} = si;
    indirect_data.info{i} = infoi;
    if (save_results);
        save('data/l1logreg_indirect', 'indirect_data');
    end
end

if run_cvx
    try
        tic
        cvx_begin
        cvx_solver(cvx_use_solver)
        variable w(p)
        minimize(sum(log_sum_exp([zeros(1,q);w'*full(X)])) + lam * norm(w,1))
        if (save_results)
            cvx.output{i} = evalc('cvx_end')
        else
            cvx_end
        end
        toc

    catch err
        err
        cvx.err{i} = err;
    end

    if (save_results); save(sprintf('data/l1logreg_cvx_%s', cvx_use_solver), 'cvx'); end;

end

end

These are the outputs of the SCS

ScsSolutionving small l1 regularized logistic regresssion. SCS deprecation warning: The 'f' field in the cone struct has been replaced by 'z' to better reflect the Zero cone. Please replace usage of 'f' with 'z'. If both 'f' and 'z' are set then we sum the two fields to get the final zero cone size.

       SCS v3.2.3 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012

problem: variables n: 30200, constraints m: 70200 cones: l: linear vars: 10200 e: exp vars: 60000, dual exp vars: 0 settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07 alpha: 1.50, scale: 5.00e+00, adaptive_scale: 1 max_iters: 100000, normalize: 1, rho_x: 1.00e-06 acceleration_lookback: 10, acceleration_interval: 10 lin-sys: sparse-direct-amd-qdldl nnz(A): 155635, nnz(P): 0

iter | pri res | dua res | gap | obj | scale | time (s)

 0| 2.00e+00  1.24e+03  6.67e+04  2.93e+04  5.00e+00  5.70e-01 

250| 6.57e-03 2.93e-03 3.96e-03 3.88e+03 3.17e-01 3.14e+00 500| 5.44e-03 2.47e-03 3.75e-03 3.88e+03 3.17e-01 5.89e+00 750| 4.51e-03 2.04e-03 3.32e-03 3.88e+03 3.17e-01 8.66e+00 1000| 3.56e-03 1.52e-03 1.77e-03 3.88e+03 3.17e-01 1.13e+01 1250| 2.89e-03 1.24e-03 1.41e-03 3.88e+03 3.17e-01 1.39e+01 1500| 2.35e-03 1.01e-03 1.26e-03 3.88e+03 3.17e-01 1.65e+01

status: solved timings: total: 1.65e+01s = setup: 5.55e-01s + solve: 1.60e+01s lin-sys: 3.15e+00s, cones: 1.18e+01s, accel: 2.08e-01s

objective = 3876.951963

SCS deprecation warning: The 'f' field in the cone struct has been replaced by 'z' to better reflect the Zero cone. Please replace usage of 'f' with 'z'. If both 'f' and 'z' are set then we sum the two fields to get the final zero cone size.

       SCS v3.2.3 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012

problem: variables n: 30200, constraints m: 70200 cones: l: linear vars: 10200 e: exp vars: 60000, dual exp vars: 0 settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07 alpha: 1.50, scale: 5.00e+00, adaptive_scale: 1 max_iters: 100000, normalize: 1, rho_x: 1.00e-06 acceleration_lookback: 10, acceleration_interval: 10 lin-sys: sparse-indirect-scs nnz(A): 155635, nnz(P): 0

iter | pri res | dua res | gap | obj | scale | time (s)

 0| 2.00e+00  1.24e+03  6.67e+04  2.93e+04  5.00e+00  1.66e-01 

250| 6.66e-03 3.39e-03 7.24e-03 3.88e+03 3.17e-01 3.43e+00 500| 5.46e-03 2.80e-03 5.07e-03 3.88e+03 3.17e-01 6.56e+00 750| 4.51e-03 2.09e-03 7.12e-03 3.88e+03 3.17e-01 9.63e+00 1000| 3.57e-03 1.54e-03 9.56e-03 3.88e+03 3.17e-01 1.28e+01 1250| 2.90e-03 1.40e-03 1.06e-02 3.88e+03 3.17e-01 1.58e+01 1500| 2.36e-03 1.07e-03 1.06e-03 3.88e+03 3.17e-01 1.89e+01 1525| 2.33e-03 9.96e-04 1.01e-03 3.88e+03 3.17e-01 1.92e+01

status: solved timings: total: 1.92e+01s = setup: 1.29e-01s + solve: 1.91e+01s lin-sys: 6.07e+00s, cones: 1.20e+01s, accel: 1.80e-01s

objective = 3876.952527

CVX Warning: Models involving "log_sum_exp" or other functions in the log, exp, and entropy family are solved using an experimental successive approximation method. This method is slower and less reliable than the method CVX employs for other models. Please see the section of the user's guide entitled The successive approximation method for more details about the approach, and for instructions on how to suppress this warning message in the future.

Calling SCS 3.2.3: 60200 variables, 40000 equality constraints


       SCS v3.2.3 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012

problem: variables n: 40000, constraints m: 60200 cones: l: linear vars: 200 e: exp vars: 0, dual exp vars: 60000 settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07 alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1 max_iters: 10000, normalize: 1, rho_x: 1.00e-06 acceleration_lookback: 10, acceleration_interval: 10 lin-sys: sparse-direct-amd-qdldl nnz(A): 250470, nnz(P): 0

iter | pri res | dua res | gap | obj | scale | time (s)

 0| 3.33e+01  1.80e+00  4.53e+05 -2.28e+05  1.00e-01  4.97e-01 

250| 3.27e-02 1.48e-03 8.04e-01 -3.88e+03 1.69e+00 2.99e+00 325| 7.77e-06 1.09e-06 1.62e-05 -3.88e+03 1.69e+00 3.76e+00

status: solved timings: total: 3.76e+00s = setup: 4.81e-01s + solve: 3.28e+00s lin-sys: 7.73e-01s, cones: 2.27e+00s, accel: 3.67e-02s

objective = -3876.966569


Status: Failed Optimal value (cvx_optval): NaN

ScsSolutionving med l1 regularized logistic regresssion. SCS deprecation warning: The 'f' field in the cone struct has been replaced by 'z' to better reflect the Zero cone. Please replace usage of 'f' with 'z'. If both 'f' and 'z' are set then we sum the two fields to get the final zero cone size.

       SCS v3.2.3 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012

problem: variables n: 302000, constraints m: 702000 cones: l: linear vars: 102000 e: exp vars: 600000, dual exp vars: 0 settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07 alpha: 1.50, scale: 5.00e+00, adaptive_scale: 1 max_iters: 100000, normalize: 1, rho_x: 1.00e-06 acceleration_lookback: 10, acceleration_interval: 10 lin-sys: sparse-direct-amd-qdldl nnz(A): 10141605, nnz(P): 0

iter | pri res | dua res | gap | obj | scale | time (s)

 0| 2.12e+00  1.22e+03  6.35e+05  2.71e+05  5.00e+00  4.79e+02 

250| 3.64e-02 2.73e-01 2.28e-02 1.15e+04 1.42e-02 5.19e+02 500| 2.56e-02 2.13e-04 1.62e-04 1.15e+04 2.45e-03 5.54e+02 750| 2.56e-02 1.87e-04 3.80e-05 1.15e+04 8.49e-03 5.91e+02 1000| 2.53e-02 4.83e-04 1.29e-03 1.15e+04 4.08e-02 6.29e+02 1250| 2.51e-02 4.64e-04 1.42e-03 1.15e+04 4.08e-02 6.61e+02 1500| 2.13e-02 1.76e-03 8.56e-04 1.15e+04 4.08e-02 6.93e+02 1750| 2.11e-02 3.94e-04 1.25e-03 1.15e+04 4.08e-02 7.25e+02 2000| 2.09e-02 3.54e-04 1.15e-03 1.15e+04 4.08e-02 7.57e+02 2250| 2.06e-02 3.55e-04 1.16e-03 1.15e+04 4.08e-02 7.90e+02 2500| 2.04e-02 3.67e-03 5.34e-04 1.15e+04 4.08e-02 8.22e+02 2750| 2.02e-02 3.26e-04 1.09e-03 1.15e+04 4.08e-02 8.54e+02 3000| 1.45e-02 3.21e-04 1.08e-03 1.15e+04 4.08e-02 8.86e+02 3250| 1.37e-02 2.91e-04 9.91e-04 1.15e+04 4.08e-02 9.18e+02 3500| 5.25e+02 2.06e+01 3.90e+00 1.16e+04 4.08e-02 9.50e+02 3750| 1.33e-02 6.97e-04 6.36e-04 1.15e+04 4.08e-02 9.82e+02 4000| 1.32e-02 2.75e-04 8.98e-04 1.15e+04 4.08e-02 1.01e+03 4250| 1.31e-02 7.76e-04 1.05e-03 1.15e+04 4.08e-02 1.05e+03 4500| 1.30e-02 2.67e-04 9.69e-04 1.15e+04 4.08e-02 1.08e+03 4750| 1.28e-02 6.69e-04 8.67e-04 1.15e+04 4.08e-02 1.11e+03 5000| 1.27e-02 6.05e-04 9.04e-04 1.15e+04 4.08e-02 1.14e+03 5250| 1.26e-02 2.32e-03 4.00e-04 1.15e+04 4.08e-02 1.17e+03 5500| 1.09e-02 3.45e-04 7.63e-04 1.15e+04 4.08e-02 1.21e+03 5750| 1.08e-02 2.22e-04 8.18e-04 1.15e+04 4.08e-02 1.25e+03 6000| 1.07e-02 2.12e-04 7.88e-04 1.15e+04 4.08e-02 1.28e+03 6250| 3.83e+02 1.50e+01 2.28e+00 1.16e+04 4.08e-02 1.32e+03 6500| 1.05e-02 2.03e-04 7.19e-04 1.15e+04 4.08e-02 1.35e+03 6750| 1.04e-02 1.80e-04 5.97e-04 1.15e+04 4.08e-02 1.38e+03 7000| 9.70e-03 1.80e-04 6.17e-04 1.15e+04 4.08e-02 1.41e+03 7250| 9.60e-03 1.74e-04 6.17e-04 1.15e+04 4.08e-02 1.44e+03 7500| 9.51e-03 1.75e-04 6.35e-04 1.15e+04 4.08e-02 1.47e+03 7750| 9.42e-03 1.56e-04 5.80e-04 1.15e+04 4.08e-02 1.51e+03 8000| 9.33e-03 1.67e-04 6.13e-04 1.15e+04 4.08e-02 1.54e+03 8250| 9.24e-03 1.60e-04 5.82e-04 1.15e+04 4.08e-02 1.57e+03 8500| 9.15e-03 1.67e-04 6.21e-04 1.15e+04 4.08e-02 1.61e+03 8750| 9.06e-03 1.66e-04 6.15e-04 1.15e+04 4.08e-02 1.64e+03 9000| 2.90e+02 1.13e+01 1.34e+00 1.15e+04 4.08e-02 1.68e+03 9250| 8.88e-03 1.55e-04 5.83e-04 1.15e+04 4.08e-02 1.71e+03 9500| 8.79e-03 1.53e-04 5.91e-04 1.15e+04 4.08e-02 1.74e+03 9750| 8.71e-03 1.54e-04 6.01e-04 1.15e+04 4.08e-02 1.78e+03 10000| 8.62e-03 1.47e-04 5.77e-04 1.15e+04 4.08e-02 1.81e+03 10250| 8.54e-03 1.39e-04 5.43e-04 1.15e+04 4.08e-02 1.85e+03 10500| 8.45e-03 1.38e-04 5.41e-04 1.15e+04 4.08e-02 1.88e+03 10750| 8.37e-03 1.09e-03 3.14e-04 1.15e+04 4.08e-02 1.92e+03 11000| 8.29e-03 1.35e-04 5.30e-04 1.15e+04 4.08e-02 1.95e+03 11250| 8.21e-03 1.32e-04 5.23e-04 1.15e+04 4.08e-02 1.99e+03 11500| 8.13e-03 1.29e-04 5.05e-04 1.15e+04 4.08e-02 2.02e+03 11750| 2.93e+02 1.14e+01 8.78e-01 1.15e+04 4.08e-02 2.05e+03 12000| 7.97e-03 1.24e-04 5.01e-04 1.15e+04 4.08e-02 2.09e+03 12250| 7.89e-03 1.12e-04 4.57e-04 1.15e+04 4.08e-02 2.12e+03 12500| 7.82e-03 1.07e-04 4.19e-04 1.15e+04 4.08e-02 2.16e+03 12750| 7.74e-03 1.05e-04 4.26e-04 1.15e+04 4.08e-02 2.19e+03 12950| 6.29e-03 1.09e-03 3.34e-04 1.15e+04 4.08e-02 2.22e+03

status: solved timings: total: 2.22e+03s = setup: 4.79e+02s + solve: 1.74e+03s lin-sys: 6.69e+02s, cones: 9.30e+02s, accel: 1.85e+01s

objective = 11541.159071

SCS deprecation warning: The 'f' field in the cone struct has been replaced by 'z' to better reflect the Zero cone. Please replace usage of 'f' with 'z'. If both 'f' and 'z' are set then we sum the two fields to get the final zero cone size.

       SCS v3.2.3 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012

problem: variables n: 302000, constraints m: 702000 cones: l: linear vars: 102000 e: exp vars: 600000, dual exp vars: 0 settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07 alpha: 1.50, scale: 5.00e+00, adaptive_scale: 1 max_iters: 100000, normalize: 1, rho_x: 1.00e-06 acceleration_lookback: 10, acceleration_interval: 10 lin-sys: sparse-indirect-scs nnz(A): 10141605, nnz(P): 0

iter | pri res | dua res | gap | obj | scale | time (s)

 0| 2.12e+00  1.22e+03  6.35e+05  2.71e+05  5.00e+00  5.89e+00 

250| 6.28e-02 3.00e-01 3.65e+00 1.15e+04 1.43e-02 8.88e+01 500| 4.26e-02 1.98e-03 8.00e-02 1.15e+04 1.43e-02 1.52e+02 750| 9.66e+02 1.33e+01 4.54e+00 1.16e+04 1.43e-02 2.18e+02 1000| 2.54e-02 2.87e-04 4.40e-03 1.15e+04 1.43e-02 2.83e+02 1250| 2.53e-02 5.37e-05 8.35e-04 1.15e+04 1.43e-02 3.46e+02 1500| 2.52e-02 8.89e-05 2.60e-03 1.15e+04 1.43e-02 4.11e+02 1750| 2.51e-02 5.30e-05 5.94e-04 1.15e+04 1.43e-02 4.76e+02 2000| 2.51e-02 1.43e-04 2.96e-03 1.15e+04 1.43e-02 5.43e+02 2250| 2.50e-02 4.33e-05 1.45e-03 1.15e+04 1.43e-02 6.03e+02 2500| 2.49e-02 5.16e-04 2.02e-02 1.15e+04 1.43e-02 6.64e+02 2750| 2.13e-02 5.40e-05 2.15e-04 1.15e+04 1.43e-02 7.24e+02 3000| 2.13e-02 4.94e-04 9.82e-03 1.15e+04 4.55e-02 7.83e+02 3250| 2.08e-02 4.48e-04 5.29e-03 1.15e+04 4.55e-02 8.42e+02 3500| 2.05e-02 4.04e-04 3.54e-04 1.15e+04 4.55e-02 8.98e+02 3750| 2.03e-02 3.73e-04 3.63e-04 1.15e+04 4.55e-02 9.55e+02 4000| 1.45e-02 1.23e-03 8.97e-03 1.15e+04 4.55e-02 1.01e+03 4250| 1.37e-02 4.28e-04 1.37e-03 1.15e+04 4.55e-02 1.07e+03 4500| 1.35e-02 3.62e-04 1.70e-03 1.15e+04 4.55e-02 1.12e+03 4750| 1.59e+02 6.96e+00 3.86e+00 1.15e+04 4.55e-02 1.18e+03 5000| 1.32e-02 3.26e-04 1.23e-04 1.15e+04 4.55e-02 1.24e+03 5250| 1.30e-02 3.23e-04 7.89e-03 1.15e+04 4.55e-02 1.29e+03 5500| 1.29e-02 3.23e-04 1.77e-04 1.15e+04 4.55e-02 1.35e+03 5750| 1.28e-02 3.12e-04 2.32e-03 1.15e+04 4.55e-02 1.41e+03 6000| 1.27e-02 3.29e-04 1.98e-03 1.15e+04 4.55e-02 1.46e+03 6250| 1.10e-02 2.77e-04 6.06e-03 1.15e+04 4.55e-02 1.52e+03 6500| 1.08e-02 2.92e-04 4.23e-03 1.15e+04 4.55e-02 1.57e+03 6750| 1.07e-02 2.79e-04 8.65e-04 1.15e+04 4.55e-02 1.63e+03 7000| 1.06e-02 2.96e-04 1.88e-03 1.15e+04 4.55e-02 1.69e+03 7250| 1.05e-02 2.46e-04 6.39e-05 1.15e+04 4.55e-02 1.74e+03 7500| 1.87e+02 8.27e+00 4.86e+00 1.15e+04 4.55e-02 1.80e+03 7750| 9.65e-03 2.03e-04 3.32e-03 1.15e+04 4.55e-02 1.86e+03 8000| 9.55e-03 2.09e-04 4.01e-03 1.15e+04 4.55e-02 1.91e+03 8250| 9.45e-03 2.39e-04 2.61e-03 1.15e+04 4.55e-02 1.97e+03 8500| 9.36e-03 2.00e-04 4.50e-03 1.15e+04 4.55e-02 2.03e+03 8750| 9.24e-03 2.14e-04 1.68e-03 1.15e+04 4.55e-02 2.09e+03 9000| 9.14e-03 2.01e-04 2.35e-03 1.15e+04 4.55e-02 2.14e+03 9250| 9.04e-03 2.09e-04 1.35e-02 1.15e+04 4.55e-02 2.20e+03 9500| 8.94e-03 1.93e-04 1.24e-04 1.15e+04 4.55e-02 2.25e+03 9750| 8.86e-03 1.98e-04 7.05e-03 1.15e+04 4.55e-02 2.31e+03 10000| 8.75e-03 1.89e-04 4.47e-04 1.15e+04 4.55e-02 2.37e+03 10250| 5.76e+02 2.51e+01 6.78e+00 1.16e+04 4.55e-02 2.43e+03 10500| 8.56e-03 1.76e-04 6.58e-04 1.15e+04 4.55e-02 2.48e+03 10750| 8.47e-03 1.64e-04 2.80e-04 1.15e+04 4.55e-02 2.54e+03 11000| 8.37e-03 1.64e-04 5.24e-05 1.15e+04 4.55e-02 2.60e+03 11250| 8.29e-03 1.62e-04 8.43e-04 1.15e+04 4.55e-02 2.65e+03 11500| 8.19e-03 1.57e-04 4.27e-04 1.15e+04 4.55e-02 2.71e+03 11750| 8.10e-03 1.55e-04 1.79e-03 1.15e+04 4.55e-02 2.77e+03 12000| 8.02e-03 1.52e-04 8.22e-04 1.15e+04 4.55e-02 2.82e+03 12250| 7.93e-03 3.15e-04 7.74e-03 1.15e+04 4.55e-02 2.88e+03 12500| 7.89e-03 1.60e-04 4.53e-03 1.15e+04 4.55e-02 2.94e+03 12750| 7.75e-03 1.25e-04 4.97e-04 1.15e+04 4.55e-02 3.00e+03 12975| 6.29e-03 1.36e-04 1.37e-03 1.15e+04 4.55e-02 3.05e+03

status: solved timings: total: 3.05e+03s = setup: 4.68e+00s + solve: 3.04e+03s lin-sys: 2.02e+03s, cones: 9.02e+02s, accel: 1.78e+01s

objective = 11541.158291

Calling SCS 3.2.3: 602000 variables, 400000 equality constraints


       SCS v3.2.3 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012

problem: variables n: 400000, constraints m: 602000 cones: l: linear vars: 2000 e: exp vars: 0, dual exp vars: 600000 settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07 alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1 max_iters: 10000, normalize: 1, rho_x: 1.00e-06 acceleration_lookback: 10, acceleration_interval: 10 lin-sys: sparse-direct-amd-qdldl nnz(A): 19675210, nnz(P): 0

iter | pri res | dua res | gap | obj | scale | time (s)

 0| 7.23e+01  1.80e+00  4.17e+06 -2.10e+06  1.00e-01  3.60e+02 

250| 2.24e-01 8.09e-03 2.43e+01 -1.17e+04 3.55e-01 4.08e+02 500| 1.01e-03 4.62e-04 3.51e-03 -1.15e+04 3.55e-01 4.48e+02 575| 4.18e-05 7.36e-06 4.08e-04 -1.15e+04 3.55e-01 4.60e+02

status: solved timings: total: 4.60e+02s = setup: 3.59e+02s + solve: 1.01e+02s lin-sys: 4.79e+01s, cones: 3.87e+01s, accel: 8.37e-01s

objective = -11542.301774


Status: Failed Optimal value (cvx_optval): NaN

bodono commented 1 year ago

I can't really see what's going on there, but cvx/cvxpy might convert the problem in a different way than doing it by hand and different formulations (even of the same problem) might have different solve characteristics.

caglarari commented 1 year ago

I see. Thanks.

The reason I asked about how to better use SCS interface (directly without cvx) was because I was wondering if it is possible to make SCS faster for specific applications using a different sparse linear system solver?

For instance, it's possible to solve a l2 regularized logistic regression problem with 500 features & 60 000 samples under 0.30 secs using a truncated Newton with PCG matlab code. While it takes 96.2 secs to solve the same problem when I use SCS via cvx.

Regards, Caglar

SCS Output

Calling SCS 3.2.3: 360502 variables, 240001 equality constraints


       SCS v3.2.3 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012

problem: variables n: 240001, constraints m: 360502 cones: q: soc vars: 502, qsize: 1 e: exp vars: 0, dual exp vars: 360000 settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07 alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1 max_iters: 10000, normalize: 1, rho_x: 1.00e-06 acceleration_lookback: 10, acceleration_interval: 10 lin-sys: sparse-direct-amd-qdldl nnz(A): 3216368, nnz(P): 0

iter | pri res | dua res | gap | obj | scale | time (s)

 0| 3.39e+03  1.73e+00  2.55e+06 -1.27e+06  1.00e-01  7.33e+01 

250| 1.98e-03 1.41e-03 5.45e-02 -1.19e+00 1.07e+02 8.33e+01 500| 1.04e-04 9.52e-04 1.66e-04 -6.65e-01 1.07e+02 9.32e+01 550| 5.89e-05 1.36e-04 1.11e-04 -6.65e-01 1.07e+02 9.64e+01

status: solved timings: total: 9.64e+01s = setup: 7.32e+01s + solve: 2.32e+01s lin-sys: 1.16e+01s, cones: 7.56e+00s, accel: 4.53e-01s

objective = -0.665310


Status: Failed Optimal value (cvx_optval): NaN

Truncated Newton with PCG Output iter primal obj stepsize norg(g) p_flg p_res p_itr 0 6.931472e-01 Inf 3.52e-01 0 0.00e+00 0 1 6.652097e-01 1.00e+00 2.29e-04 0 7.39e-05 2 2 6.652097e-01 1.00e+00 1.81e-08 0 7.90e-05 2 3 6.652097e-01 1.00e+00 1.18e-14 0 6.47e-07 3 Absolute tolerance reached. 500/500 Elapsed time is 0.294279 seconds.

Matlab Code for l2 regularized Truncated Newton with PCG

function [w,status,history] = l2_logreg(X,b,lambda,method,ptol,pmaxi) % % l2-regularized Logistic Regression Problem Solver % % L2_LOGREG solves problems of the following form: % % minimize (1/m) sum_i log(1+exp(-b_i(x_iw))) + sum_j lambda_iw_j^2, % % where variable is w and problem data are x_i, b_i and lambda_i. % % INPUT % % X : mxn matrix; input data. each row corresponds to each feature % b : m vector; class label % lambda : positive n-vector; regularization parameter % % method : string; search direction method type % 'cg' : conjugate gradients method, 'pcg' % 'pcg' : preconditioned conjugate gradients method % 'exact': exact method (default value) % ptol : scalar; pcg relative tolerance. if empty, use adaptive rule. % pmaxi : scalar: pcg maximum iteration. if empty, use default value (500). % % OUTPUT % % w : n vector; classifier % status : scalar; +1: success, -1: maxiter exceeded % history : % row 1) phi % row 2) norm(gradient of phi) % row 3) cumulative cg iterations % % USAGE EXAMPLE % % [w,status] = l2_logreg(X,b,lambda,'pcg'); %

% Written by Kwangmoo Koh deneb1@stanford.edu

%------------------------------------------------------------ % INITIALIZE %------------------------------------------------------------

% NEWTON PARAMETERS MAX_TNT_ITER = 200; % maximum (truncated) Newton iteration ABSTOL = 1e-8; % terminates when the norm of gradient < ABSTOL

% LINE SEARCH PARAMETERS ALPHA = 0.01; % minimum fraction of decrease in norm(gradient) BETA = 0.5; % stepsize decrease factor MAX_LS_ITER = 100; % maximum backtracking line search iteration

[m,n] = size(X); % problem size: m examples, n features

if(isempty(pmaxi)) pcgmaxi = 500; else pcgmaxi = pmaxi; end if(isempty(ptol )) pcgtol = 1e-4; else pcgtol = ptol; end

% INITIALIZE pobj = Inf; s = inf; pitr = 0 ; pflg = 0 ; prelres = 0; pcgiter = 0; history = [];

w = zeros(n,1); dw = zeros(n,1);

A = sparse(1:m,1:m,b)*X; %if (strcmp(method,'cg')||strcmp(method,'pcg')) A2 = A.^2; end A2 = A.^2;

disp(sprintf('%s %15s %10s %10s %6s %10s %6s',... 'iter','primal obj','stepsize','norg(g)','p_flg','p_res','p_itr'));

%------------------------------------------------------------ % MAIN LOOP %------------------------------------------------------------

for ntiter = 0:MAX_TNT_ITER Aw = Aw; expAw = exp(Aw); expmAw = exp(-Aw); g = -1/m./(1+expAw); h = 1/m./(2+expAw+expmAw); gradphi = A'g+2lambda.w; normg = norm(gradphi);

phi = sum(log(1+expmAw))/m+sum(lambda.*w.*w);
disp(sprintf('%4d %15.6e %10.2e %10.2e %6d %10.2e %6d',...
            ntiter,phi,s,normg,pflg,prelres,pitr));
history = [history [phi; normg; pcgiter]];

%------------------------------------------------------------
%   STOPPING CRITERION
%------------------------------------------------------------
if (norm(gradphi) < ABSTOL) 
    status = 1;
    disp('Absolute tolerance reached.');
    disp(sprintf('%d/%d',sum(abs((A2'*h)./(2*lambda))<0.5),n));
    return;
end
%------------------------------------------------------------
%       CALCULATE NEWTON STEP
%------------------------------------------------------------
switch lower(method)
    case 'pcg'
    if (isempty(ptol)) pcgtol = min(0.1,norm(gradphi)); end
    [dw, pflg, prelres, pitr, presvec] = ...
        pcg(@AXfunc,-gradphi,pcgtol,pcgmaxi,@Mfunc,[],[],...
            A,h,2*lambda,1./(A2'*h+2*lambda));
    if (pitr == 0) pitr = pcgmaxi; end

    case 'cg'
    if (isempty(ptol)) pcgtol = min(0.1,norm(gradphi)); end
    [dw, pflg, prelres, pitr, presvec] = ...
        pcg(@AXfunc,-gradphi,pcgtol,pcgmaxi,[],[],[],...
            A,h,2*lambda,[]);
    if (pitr == 0) pitr = pcgmaxi; end

    otherwise % exact method
    hessphi = A'*sparse(1:m,1:m,h)*A+2*sparse(1:n,1:n,lambda);
    dw = -hessphi\gradphi;
end
pcgiter = pcgiter+pitr;
%------------------------------------------------------------
%   BACKTRACKING LINE SEARCH
%------------------------------------------------------------
s = 1;
for lsiter = 1:MAX_LS_ITER
    new_w = w+s*dw;
    newgradphi = A'*(-1/m./(1+exp(A*new_w)))+2*lambda.*new_w;
    if (norm(newgradphi)<=(1-ALPHA*s)*normg) break; end
    s = BETA*s;
end
if (lsiter == MAX_LS_ITER) break; end
w = new_w;

end status = -1; end %------------------------------------------------------------ % CALL BACK FUNCTIONS FOR PCG %------------------------------------------------------------ function y = AXfunc(x,A,h,d,p) y = A'(h.(Ax))+d.x; end

function y = Mfunc(x,A,h,d,p) y = x.*p; end

bodono commented 1 year ago

Rolling your own code for something like l2-squared logistic regression will definitely be faster than a general conic solver. The issue will arise if you have non-differentiable terms or constraints etc, then you will need something more general like SCS. By the way, yes swapping out the linear solver can make a big difference. SCS supports MKL as an alternative backend in python, C etc. Still it surely won't be faster than your own specialized solver for something like l2 regularized logistic regression.

caglarari commented 1 year ago

Thanks.

SCS looks definitely faster & more scalable than SDPT3 & Sedumi. That's great. Thanks for SCS.

Sure I am not expecting any general purpose solver to be as fast as the application specific algorithm but something that can be made 10x or 20x slower would have been great :)

By the way, no matter what the application is, I often use Truncated Newton with a simple logarithmic barrier function for inequalities. So I always solve the same differentiable objective for a sequence of barrier parameter t. It takes like 40 to 80 iterations in total at most. Only thing that changes from application to application is the preconditioner (at worst diagonal of the Hessian) for conjugate gradient

I am trying to figure out if a first order algorithm like SCS can be made faster than a second order algorithm like Truncated Newton for large scale (inequality constrained) applications using a different linear system solver, etc.

Thanks again, Regards, Caglar

bodono commented 1 year ago

What you're describing is very similar (at least in spirit) to an interior point method. These second-order methods perform very well for small-ish problems, but eventually a first-order method like SCS will be faster. If you run these logistic regression problems through ECOS and compare to SCS, you will see that ECOS is faster for smaller problems but for larger problems SCS tends to win out.

caglarari commented 1 year ago

Yes, I am using the simplest interior point method described in prof Boyd's Convex Optimization book.

You are right, SCS is both faster & more scalable than ECOS, Sedumi & SDPT3. So for general purpose solvers, it seems like first order algorithms like SCS wins for larger problems.

The thing is though, once I replaced the factor solve part in that simple interior point algorithm with PCG, it was way way faster (& more robust) than any other first order algorithm I tried.

I did experiment with gradient descent & mirror descent. Gradient descent was slower than mirror descent for inequality constrained convex problems where I used the logarithmic barrier function. This was expected since I had nonlinear convex terms like negative logartihm, quadratic over linear function, etc in the inequality constraints.

Mirror descent was expected to work better since the gradient of nonlinear convex functions give us an expression in terms of the dual variable (for instance grad_s (-logs) = -1/s, )

However, once I started to use PCG in place of factor solve in that simple interior point algorithm, interior point algorithm was way way faster both for small problems & large problems. It is very scalable, too. I don't even need to form the Hessian. I just need to write a function that multiplies the Hessian with a vector. So I am not even sure if it is considered a second order method as long as small number of PCG steps are used to compute the Newton step.

Now I am not aware of anything else I can add to make mirror descent (or gradient descent) faster. So that simple interior point algorithm with PCG seems like hands down the better choice.

However, your conic solver algorithm is also a first order algorithm & it can also make use of PCG (indirect method to solve the linear system). That's what I was wondering. Could SCS be made work faster for specific applications via using a different indirect linear system solver?

Thanks, Regards, Caglar

bodono commented 1 year ago

If using PCG to solve the linear system works for you in IPMs then great, you should use it! There has been a lot of work in that direction going back to at least the 90s by Jacek Gondzio. In my experience however, PCG tends not to work that well with IPMs. The reason being that the condition number of the linear system being solved blows up as you get close to the solution, so the number of CG steps required to solve it goes way up, and you need quite high accuracy solves for IPMs to work so it ends up being more expensive than just a standard factor-solve. I think it's likely that there are some tricks to control the conditioning that we haven't discovered yet though.

As for SCS - it comes with an indirect (PCG) linear system solver built in and, unlike IPMs, the condition number is unchanged as you get close to the solution, ADMM generally tolerates some error in the solves, and tricks like pre-conditioning and warm-start work well. In other words the indirect solver is viable in a first-order method like SCS. That being said, the direct method (especially with MKL backend) just tends to be faster until you get to huge problem sizes, so that's the default.

caglarari commented 1 year ago

I see. Thanks. I was actually wondering why I don't see truncated Newton Interior Point method papers anymore. I guess it was due to high accuracy requirement. Before 2010, there were some like https://web.stanford.edu/~boyd/papers/pdf/ls_num.pdf https://web.stanford.edu/~boyd/papers/pdf/l1_logistic_reg.pdf

I personally do not really have to deal with equality constraints & I use like the simplest possible interior point algorithm so I guess that's why it's easy to make PCG work. Also, I think for many applications (at least the ones I am working on), it would be enough if the barrier parameter t is fixed to something like 1000, (1/t=0.001). So there isn't really any need for high accuracy.

After all, we have regularization parameters so probably we move through a similar path as we gradually increase the regularization parameters anyway. I actually did try that as described in this paper https://web.stanford.edu/~boyd/papers/pdf/fast_mpc.pdf but it took more iterations (as they have mentioned) so went back do gradually increasing t until certain accuracy is achieved or maximum number of iterations is reached.

SCS works great. Thanks. I guess it would be better to implement a simple conic solver, get the (optimization) problem data from cvxpy for example applications & compare it with both SCS & simple truncated Newton methods to get a better understanding of the algorithm.

Thanks again, Regards, Caglar

Thanks, Regards, Caglar