Improve BIC Search Algorithm

When doing nested BIC search, do we need to update both U and V? (The current code holds V constant while optimizing over U and vice versa) Formally, in the BIC expression, v is a function of u_hat so it seems weird to not change v. On the flip side, we're just optimizing a regression (with constraint). My concern is that, if we're just solving the penalized regression problem and Xv is a sub-unit vector, then we can always choose u to be exactly Xv and then we get zero variance and stumble into a log(0) problem. (Or at least, that's what I got out of a quick skim. Please correct me if I'm wrong @Banana1530.)

For well-initialized V, this doesn't make a big difference so we can usually get away with it: in particular, if V is well initialized (maybe because we're already using a full grid or tuning parameters are small enough that the SVD is really close to correct), V_init and V_stationary point are close, so BIC(U, V_init) and BIC(U, V_stationary point) will be similar. That seems like too strong an assumption to make in practice though.

The below experiments indicate what I mean. If we set UPDATE_V = 0 but leave INITIALIZE_V_SVD = 1, we get saved by the fact that v_star and V1 are pretty close; but if we set both flags to 0, corresponding to a random initialization of v_hat without updating, things go poorly. If we have INITIALIZE_V_SVD = 0, but keep UPDATE_V = 1, we do better, but not nearly as well as using the SVD initialization.

UPDATE_V = 1;        % Set to 0 to fix V at initialization value
INITIALIZE_SVD = 1;  % Set to 0 to initialize U, V to random unit vector

n = 50; p = 25; s = 5; 
u_star = [ones(s, 1); zeros(n - s, 1)]; v_star = [zeros(p - s, 1); ones(s, 1)]; d = 3; 

N = randn(n, p);
S = d * u_star * v_star'; 

X = S + N;  
[U, ~, V] = svd(X); 

st = @(x, lambda) sign(x) .* max(abs(x) - lambda, 0); 

% Initialize SFPCA
U1 = U(:, 1); V1 = V(:, 1);

% SFPCA search on lambda_U
% Keep v parameters fixed for now...
lambda_u_range = linspace(0, 5, 51);
n_lambda_u     = size(lambda_u_range, 2);

sigma_hat_holder = zeros(size(lambda_u_range)); 
bic_holder     = zeros(size(lambda_u_range));
df_holder      = zeros(size(lambda_u_range)); 

for lu_ix=1:n_lambda_u
  lu = lambda_u_range(lu_ix);

  % Quick and dirty SFPCA - only doing sparsity in U

  if INITIALIZE_SVD
      u_hat = U1; 
      v_hat = V1; 
  else 
      u_hat = randn(n, 1); u_hat = u_hat / norm(u_hat); 
      v_hat = randn(p, 1); v_hat = v_hat / norm(v_hat);
  end

  u_hat_old = u_hat + 5000; v_hat_old = v_hat + 5000; 

  while norm(u_hat - u_hat_old) + norm(v_hat - v_hat_old) > 1e-6
      while norm(u_hat - u_hat_old) > 1e-6
          u_hat_old = u_hat; 
          u_hat = st(X * v_hat, lu); 
          u_hat = u_hat / norm(u_hat); 
      end

      if UPDATE_V
          while norm(v_hat - v_hat_old) > 1e-6
              v_hat_old = v_hat;
              v_hat = st(u_hat' * X, 0);
              v_hat = v_hat / norm(v_hat);
              v_hat = v_hat'; % Keep sizes correct
          end
      else
          v_hat_old = v_hat;
      end
  end

  sigma_hat_sq = mean((X * v_hat - u_hat).^2); 

  sigma_hat_holder(lu_ix) = sigma_hat_sq; 
  df_holder(lu_ix)  = sum(u_hat ~= 0); 
  bic_holder(lu_ix) = log(sigma_hat_sq / n) + 1 / n * log(n) * sum(u_hat ~= 0); 
end

[min_bic, min_bic_ind] = min(bic_holder);

lambda_u_optimal = lambda_u_range(min_bic_ind);

% Quick and dirty SFPCA - only doing sparsity in U

if INITIALIZE_SVD
    u_hat = U1; 
    v_hat = V1; 
else 
    u_hat = randn(n, 1); u_hat = u_hat / norm(u_hat); 
    v_hat = randn(p, 1); v_hat = v_hat / norm(v_hat);
end

  u_hat_old = u_hat + 5000; v_hat_old = v_hat + 5000;

while norm(u_hat - u_hat_old) + norm(v_hat - v_hat_old) > 1e-6
    while norm(u_hat - u_hat_old) > 1e-6
        u_hat_old = u_hat; 
        u_hat = st(X * v_hat, lambda_u_optimal); 
        u_hat = u_hat / norm(u_hat); 
    end

    if UPDATE_V
        while norm(v_hat - v_hat_old) > 1e-6
            v_hat_old = v_hat;
            v_hat = st(u_hat' * X, 0);
            v_hat = v_hat / norm(v_hat);
            v_hat = v_hat'; % Keep sizes correct
        end
    else
        v_hat_old = v_hat;
    end
end

snr = max(svd(X)) / max(svd(N)); 

u_hat_supp = u_hat ~= 0; 
u_star_supp = u_star ~= 0; 
u_star_nonsupp = u_star == 0; 

tpr = mean(u_hat_supp(u_star_supp)); 
fpr = mean(u_hat_supp(u_star_nonsupp));

Running 1000 replicates, I see

                           | TPR  |  FPR  | SNR
Update_V = 1, INIT_SVD = 1 | 96%  | 0%    | 1.5
Update_V = 1, INIT_SVD = 0 | 42%  | 26%   | 1.5
Update_V = 0, INIT_SVD = 1 | 96%  | 0.1%  | 1.5
Update_V = 0, INIT_SVD = 0 | 36%  | 18%   | 1.5

My hunch is that the Update_V = 0, INIT_SVD = 1 case would suffer more on harder situations than this: the angle between V1(S + N) and v_star isn't very much for this problem.

DataSlingers / MoMA

Improve BIC Search Algorithm #65