cvxgrp / scs

Splitting Conic Solver
MIT License
553 stars 136 forks source link

SCS fails for l2 regularized logistic regression after certain problem size #252

Open caglarari opened 1 year ago

caglarari commented 1 year ago

Hi,

I am using SCS 3.2.3 via cvx on a windows 10 machine. I was testing SCS on simple ML problems gradually increasing the problem size to see how it performs.

I observed that after certain size (500 features, 80 000 samples), SCS starts to fail to solve l2 regularized logistic regression problems. In addition, after certain size (500 features, 40 000 samples), the final objective value SCS reports & the objective value computed using the optimum parameters returned by cvx starts to become different.

Do you know what's the issue?

Thanks, Regards, Caglar Ari

SCS Output: Calling SCS 3.2.3: 480502 variables, 320001 equality constraints


       SCS v3.2.3 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012

problem: variables n: 320001, constraints m: 480502 cones: q: soc vars: 502, qsize: 1 e: exp vars: 0, dual exp vars: 480000 settings: eps_abs: 1.0e-04, eps_rel: 1.0e-04, eps_infeas: 1.0e-07 alpha: 1.50, scale: 1.00e-01, adaptive_scale: 1 max_iters: 10000, normalize: 1, rho_x: 1.00e-06 acceleration_lookback: 10, acceleration_interval: 10 lin-sys: sparse-direct-amd-qdldl nnz(A): 40480002, nnz(P): 0

iter | pri res | dua res | gap | obj | scale | time (s)

 0| 7.66e+03  7.01e+00  3.25e+06 -1.63e+06  1.00e-01  3.48e+01 

250| 5.05e-03 5.70e-03 1.20e-01 -1.74e+00 2.26e+02 1.12e+02 500| 1.20e-03 2.55e-03 2.26e-02 -6.75e-01 2.26e+02 1.55e+02 750| 1.54e-02 5.85e-01 1.58e-02 1.34e-03 2.26e+02 2.01e+02 1000| 1.68e-03 1.35e-03 2.80e-02 -1.21e-01 2.26e+02 2.56e+02 1250| 1.56e-03 1.36e-03 2.83e-02 -2.85e-02 2.26e+02 3.11e+02 1500| 1.52e-03 1.41e-03 2.94e-02 6.00e-02 2.26e+02 3.65e+02 1750| 1.55e-03 1.51e-03 3.14e-02 1.52e-01 2.26e+02 4.18e+02 2000| 1.64e-03 1.68e-03 3.48e-02 2.56e-01 2.26e+02 4.74e+02 2250| 1.83e-03 1.95e-03 4.03e-02 3.86e-01 2.26e+02 5.22e+02 2500| 2.17e-03 2.39e-03 4.96e-02 5.66e-01 2.26e+02 5.68e+02 2750| 2.82e-03 3.22e-03 6.67e-02 8.56e-01 2.26e+02 6.16e+02 3000| 4.36e-03 5.16e-03 1.07e-01 1.48e+00 2.26e+02 6.61e+02 3250| 1.14e-02 1.40e-02 2.89e-01 4.16e+00 2.26e+02 7.05e+02 3425| 2.62e+09 1.78e+08 1.23e+17 3.23e+17 2.26e+02 7.34e+02

status: infeasible timings: total: 7.34e+02s = setup: 3.43e+01s + solve: 7.00e+02s lin-sys: 4.73e+02s, cones: 1.50e+02s, accel: 4.11e+00s

objective = inf

Here is the matlab code I used rand('state',0) randn('state',0)

sizes = [500 80000]; p = sizes(1,1); % number of features q = sizes(1,2); % number of samples

%Generate some data w_true = randn(p,1);
X_tmp = 3randn(p, q); X_tmp(p,:)=1; %set the last feature to 1 % ips = -w_true'X_tmp; ps = (exp(ips)./(1 + exp(ips)))'; labels = 2*(rand(q,1) < ps) - 1;

X_pos = X_tmp(:,labels==1); X_neg = X_tmp(:,labels==-1);

X = [X_pos -X_neg]; % include labels with data clear X_tmp ips ps labels;

Xcvx=[X_pos';X_neg']; kN1=size(X_pos,2); kN2=size(X_neg,2); bcvx=[ones(kN1,1);zeros(kN2,1)]; w_init=w_true; lam=1.0; fobj_ref=(-bcvx'full(Xcvx)w_init+sum(log_sum_exp([zeros(1,q);w_init'full(Xcvx)'])))/q+ lamsum_square(w_init);

cvx_use_solver = 'scs'; lam=1.0; %Use SCS via cvx for l2 reg logistic cvx_begin cvx_precision high cvx_solver(cvx_use_solver) variable w(p) minimize((-bcvx'full(Xcvx)w+sum(log_sum_exp([zeros(1,q);w'full(Xcvx)'])))/q+ lamsum_square(w)) cvx_end % fopt_cvx=((-bcvx'full(Xcvx)w+sum(log_sum_exp([zeros(1,q);w'full(Xcvx)'])))/q)+ lamsum_square(w); fprintf(1,"Cvx objective value: %f\n", fopt_cvx)

%Use truncated Newton for comparison Xh=[X_pos';X_neg']; kN1=size(X_pos,2); kN2=size(X_neg,2); Xh=full(Xh); bh=[ones(kN1,1);-ones(kN2,1)]; lambda=1.0; method='pcg'; ptol=1e-4; pmaxi=1000; tic [w1,status1,history1] = l2_logreg(Xh,bh,lambda,method,ptol,pmaxi); toc

%Compute the objective used in the truncated Newton code [mx,nx] = size(Xh); A = sparse(1:mx,1:mx,bh)Xh; Aw = Aw1; expAw = exp(Aw); expmAw = exp(-Aw); fobj_tnt=sum(log(1+expmAw))/mx+sum(lam.w1.w1);

%Compare the objective values fprintf(1,"fobj_cvx: %f\n",fopt_cvx) fprintf(1,"fobj_tnt: %f\n",fobj_tnt)

function [w,status,history] = l2_logreg(X,b,lambda,method,ptol,pmaxi) % % l2-regularized Logistic Regression Problem Solver % % L2_LOGREG solves problems of the following form: % % minimize (1/m) sum_i log(1+exp(-b_i(x_iw))) + sum_j lambda_iw_j^2, % % where variable is w and problem data are x_i, b_i and lambda_i. % % INPUT % % X : mxn matrix; input data. each row corresponds to each feature % b : m vector; class label % lambda : positive n-vector; regularization parameter % % method : string; search direction method type % 'cg' : conjugate gradients method, 'pcg' % 'pcg' : preconditioned conjugate gradients method % 'exact': exact method (default value) % ptol : scalar; pcg relative tolerance. if empty, use adaptive rule. % pmaxi : scalar: pcg maximum iteration. if empty, use default value (500). % % OUTPUT % % w : n vector; classifier % status : scalar; +1: success, -1: maxiter exceeded % history : % row 1) phi % row 2) norm(gradient of phi) % row 3) cumulative cg iterations % % USAGE EXAMPLE % % [w,status] = l2_logreg(X,b,lambda,'pcg'); %

% Written by Kwangmoo Koh deneb1@stanford.edu

%------------------------------------------------------------ % INITIALIZE %------------------------------------------------------------

% NEWTON PARAMETERS MAX_TNT_ITER = 200; % maximum (truncated) Newton iteration ABSTOL = 1e-8; % terminates when the norm of gradient < ABSTOL

% LINE SEARCH PARAMETERS ALPHA = 0.01; % minimum fraction of decrease in norm(gradient) BETA = 0.5; % stepsize decrease factor MAX_LS_ITER = 100; % maximum backtracking line search iteration

[m,n] = size(X); % problem size: m examples, n features

if(isempty(pmaxi)) pcgmaxi = 500; else pcgmaxi = pmaxi; end if(isempty(ptol )) pcgtol = 1e-4; else pcgtol = ptol; end

% INITIALIZE pobj = Inf; s = inf; pitr = 0 ; pflg = 0 ; prelres = 0; pcgiter = 0; history = [];

w = zeros(n,1); dw = zeros(n,1);

A = sparse(1:m,1:m,b)*X; %if (strcmp(method,'cg')||strcmp(method,'pcg')) A2 = A.^2; end A2 = A.^2;

disp(sprintf('%s %15s %10s %10s %6s %10s %6s',... 'iter','primal obj','stepsize','norg(g)','p_flg','p_res','p_itr'));

%------------------------------------------------------------ % MAIN LOOP %------------------------------------------------------------

for ntiter = 0:MAX_TNT_ITER Aw = Aw; expAw = exp(Aw); expmAw = exp(-Aw); g = -1/m./(1+expAw); h = 1/m./(2+expAw+expmAw); gradphi = A'g+2lambda.w; normg = norm(gradphi);

phi = sum(log(1+expmAw))/m+sum(lambda.*w.*w);
disp(sprintf('%4d %15.6e %10.2e %10.2e %6d %10.2e %6d',...
            ntiter,phi,s,normg,pflg,prelres,pitr));
history = [history [phi; normg; pcgiter]];

%------------------------------------------------------------
%   STOPPING CRITERION
%------------------------------------------------------------
if (norm(gradphi) < ABSTOL) 
    status = 1;
    disp('Absolute tolerance reached.');
    disp(sprintf('%d/%d',sum(abs((A2'*h)./(2*lambda))<0.5),n));
    return;
end
%------------------------------------------------------------
%       CALCULATE NEWTON STEP
%------------------------------------------------------------
switch lower(method)
    case 'pcg'
    if (isempty(ptol)) pcgtol = min(0.1,norm(gradphi)); end
    [dw, pflg, prelres, pitr, presvec] = ...
        pcg(@AXfunc,-gradphi,pcgtol,pcgmaxi,@Mfunc,[],[],...
            A,h,2*lambda,1./(A2'*h+2*lambda));
    if (pitr == 0) pitr = pcgmaxi; end

    case 'cg'
    if (isempty(ptol)) pcgtol = min(0.1,norm(gradphi)); end
    [dw, pflg, prelres, pitr, presvec] = ...
        pcg(@AXfunc,-gradphi,pcgtol,pcgmaxi,[],[],[],...
            A,h,2*lambda,[]);
    if (pitr == 0) pitr = pcgmaxi; end

    otherwise % exact method
    hessphi = A'*sparse(1:m,1:m,h)*A+2*sparse(1:n,1:n,lambda);
    dw = -hessphi\gradphi;
end
pcgiter = pcgiter+pitr;
%------------------------------------------------------------
%   BACKTRACKING LINE SEARCH
%------------------------------------------------------------
s = 1;
for lsiter = 1:MAX_LS_ITER
    new_w = w+s*dw;
    newgradphi = A'*(-1/m./(1+exp(A*new_w)))+2*lambda.*new_w;
    if (norm(newgradphi)<=(1-ALPHA*s)*normg) break; end
    s = BETA*s;
end
if (lsiter == MAX_LS_ITER) break; end
w = new_w;

end status = -1; end %------------------------------------------------------------ % CALL BACK FUNCTIONS FOR PCG %------------------------------------------------------------ function y = AXfunc(x,A,h,d,p) y = A'(h.(Ax))+d.x; end

function y = Mfunc(x,A,h,d,p) y = x.*p; end

bodono commented 1 year ago

This is probably just requires tuning eps_abs and eps_rel for more/less accuracy, and reducing eps_infeas to prevent erroneously returning infeasible, which appears to be what has happened. SCS has found a certificate of infeasibility for the conic formulation of the problem to the desired accuracy, so tighten the infeasibility tolerance to prevent that if it's not what you want.

I'm not sure if CVX is really being updated much these days, but if you switch to CVXPY (or just call SCS directly) then you can use the quadratic version of the solver which will be able to handle the l2 regularization without using a second-order cone, which will likely be faster and more reliable.