When doing nested BIC search, do we need to update both U and V? (The current code holds V constant while optimizing over U and vice versa) Formally, in the BIC expression, v is a function of u_hat so it seems weird to not change v. On the flip side, we're just optimizing a regression (with constraint). My concern is that, if we're just solving the penalized regression problem and Xv is a sub-unit vector, then we can always choose u to be exactly Xv and then we get zero variance and stumble into a log(0) problem. (Or at least, that's what I got out of a quick skim. Please correct me if I'm wrong @Banana1530.)
For well-initialized V, this doesn't make a big difference so we can usually get away with it: in particular, if V is well initialized (maybe because we're already using a full grid or tuning parameters are small enough that the SVD is really close to correct), V_init and V_stationary point are close, so BIC(U, V_init) and BIC(U, V_stationary point) will be similar. That seems like too strong an assumption to make in practice though.
The below experiments indicate what I mean. If we set UPDATE_V = 0 but leave INITIALIZE_V_SVD = 1, we get saved by the fact that v_star and V1 are pretty close; but if we set both flags to 0, corresponding to a random initialization of v_hat without updating, things go poorly. If we have INITIALIZE_V_SVD = 0, but keep UPDATE_V = 1, we do better, but not nearly as well as using the SVD initialization.
UPDATE_V = 1; % Set to 0 to fix V at initialization value
INITIALIZE_SVD = 1; % Set to 0 to initialize U, V to random unit vector
n = 50; p = 25; s = 5;
u_star = [ones(s, 1); zeros(n - s, 1)]; v_star = [zeros(p - s, 1); ones(s, 1)]; d = 3;
N = randn(n, p);
S = d * u_star * v_star';
X = S + N;
[U, ~, V] = svd(X);
st = @(x, lambda) sign(x) .* max(abs(x) - lambda, 0);
% Initialize SFPCA
U1 = U(:, 1); V1 = V(:, 1);
% SFPCA search on lambda_U
% Keep v parameters fixed for now...
lambda_u_range = linspace(0, 5, 51);
n_lambda_u = size(lambda_u_range, 2);
sigma_hat_holder = zeros(size(lambda_u_range));
bic_holder = zeros(size(lambda_u_range));
df_holder = zeros(size(lambda_u_range));
for lu_ix=1:n_lambda_u
lu = lambda_u_range(lu_ix);
% Quick and dirty SFPCA - only doing sparsity in U
if INITIALIZE_SVD
u_hat = U1;
v_hat = V1;
else
u_hat = randn(n, 1); u_hat = u_hat / norm(u_hat);
v_hat = randn(p, 1); v_hat = v_hat / norm(v_hat);
end
u_hat_old = u_hat + 5000; v_hat_old = v_hat + 5000;
while norm(u_hat - u_hat_old) + norm(v_hat - v_hat_old) > 1e-6
while norm(u_hat - u_hat_old) > 1e-6
u_hat_old = u_hat;
u_hat = st(X * v_hat, lu);
u_hat = u_hat / norm(u_hat);
end
if UPDATE_V
while norm(v_hat - v_hat_old) > 1e-6
v_hat_old = v_hat;
v_hat = st(u_hat' * X, 0);
v_hat = v_hat / norm(v_hat);
v_hat = v_hat'; % Keep sizes correct
end
else
v_hat_old = v_hat;
end
end
sigma_hat_sq = mean((X * v_hat - u_hat).^2);
sigma_hat_holder(lu_ix) = sigma_hat_sq;
df_holder(lu_ix) = sum(u_hat ~= 0);
bic_holder(lu_ix) = log(sigma_hat_sq / n) + 1 / n * log(n) * sum(u_hat ~= 0);
end
[min_bic, min_bic_ind] = min(bic_holder);
lambda_u_optimal = lambda_u_range(min_bic_ind);
% Quick and dirty SFPCA - only doing sparsity in U
if INITIALIZE_SVD
u_hat = U1;
v_hat = V1;
else
u_hat = randn(n, 1); u_hat = u_hat / norm(u_hat);
v_hat = randn(p, 1); v_hat = v_hat / norm(v_hat);
end
u_hat_old = u_hat + 5000; v_hat_old = v_hat + 5000;
while norm(u_hat - u_hat_old) + norm(v_hat - v_hat_old) > 1e-6
while norm(u_hat - u_hat_old) > 1e-6
u_hat_old = u_hat;
u_hat = st(X * v_hat, lambda_u_optimal);
u_hat = u_hat / norm(u_hat);
end
if UPDATE_V
while norm(v_hat - v_hat_old) > 1e-6
v_hat_old = v_hat;
v_hat = st(u_hat' * X, 0);
v_hat = v_hat / norm(v_hat);
v_hat = v_hat'; % Keep sizes correct
end
else
v_hat_old = v_hat;
end
end
snr = max(svd(X)) / max(svd(N));
u_hat_supp = u_hat ~= 0;
u_star_supp = u_star ~= 0;
u_star_nonsupp = u_star == 0;
tpr = mean(u_hat_supp(u_star_supp));
fpr = mean(u_hat_supp(u_star_nonsupp));
My hunch is that the Update_V = 0, INIT_SVD = 1 case would suffer more on harder situations than this: the angle between V1(S + N) and v_star isn't very much for this problem.
When doing nested BIC search, do we need to update both U and V? (The current code holds V constant while optimizing over U and vice versa) Formally, in the BIC expression,
v
is a function ofu_hat
so it seems weird to not changev
. On the flip side, we're just optimizing a regression (with constraint). My concern is that, if we're just solving the penalized regression problem andXv
is a sub-unit vector, then we can always chooseu
to be exactlyXv
and then we get zero variance and stumble into alog(0)
problem. (Or at least, that's what I got out of a quick skim. Please correct me if I'm wrong @Banana1530.)For well-initialized V, this doesn't make a big difference so we can usually get away with it: in particular, if V is well initialized (maybe because we're already using a full grid or tuning parameters are small enough that the SVD is really close to correct), V_init and V_stationary point are close, so BIC(U, V_init) and BIC(U, V_stationary point) will be similar. That seems like too strong an assumption to make in practice though.
The below experiments indicate what I mean. If we set
UPDATE_V = 0
but leaveINITIALIZE_V_SVD = 1
, we get saved by the fact thatv_star
andV1
are pretty close; but if we set both flags to 0, corresponding to a random initialization ofv_hat
without updating, things go poorly. If we haveINITIALIZE_V_SVD = 0
, but keepUPDATE_V = 1
, we do better, but not nearly as well as using the SVD initialization.Running 1000 replicates, I see
My hunch is that the
Update_V = 0, INIT_SVD = 1
case would suffer more on harder situations than this: the angle betweenV1(S + N)
andv_star
isn't very much for this problem.