Closed dswah closed 7 years ago
or this
but setting lamT just a little higher makes things OK?
and setting lamT a little lower looks strange, but legal:
since Theta regularization is the variable here, is the problem in our Theta coordinate descent?
seems likely since i tried setting the slack parameter in the backtracking to 1e-9. This means that the lambda steps are guaranteed to reduce the loss...
here is the new descent check:
rhs = np.trace(np.dot(self.grad_wrt_Lam(fixed, vary), newton_lambda)) + \
self.lamL * self.l1_norm_off_diag(self.Lam + newton_lambda) - \
self.lamL * self.l1_norm_off_diag(self.Lam)
lhs = self.l1_neg_log_likelihood_wrt_Lam(self.Lam + alpha * newton_lambda, fixed, vary) -\
self.l1_neg_log_likelihood_wrt_Lam(self.Lam, fixed, vary)
lhs <= alpha * self.slack * rhs
with slack = 1e-9, essentially we are checking
lhs <= 0
which means
self.l1_neg_log_likelihood_wrt_Lam(self.Lam + alpha * newton_lambda, fixed, vary) <=
self.l1_neg_log_likelihood_wrt_Lam(self.Lam, fixed, vary)
on toy problems the optimization works very well for all of the cases posted above. Tried on random cluster graphs 50x50.
this plot shows that even in the problem where our loss increases, the loss always decreases after the lamba update.
so the problem cannot be in the backtracking logic.
The equations are still wrong.
On real problems we get increasing cost.
Also the optimization is currently very dependent on the regularization values. changing them slightly can produce very different results