Open jsoules opened 4 months ago
I discovered this while writing the momentum parameter adaption in A-NMD. It had convergence issues when using the proxy Z - L
instead of np.maximum(0.0, L) - X
.
I added a notebook that displays the issue on an example from our tests. Following zoomed-in diagram shows the proxy loss Z - L
on the x-axis and np.maximum(0.0, L) - X
on y. Left plot without momentum, the right one with a high momentum parameter.
Based on this, I suspect the issue is not due to the positive values, but due to the negative ones.
Which of these best describes your feature request:
Describe how the new feature would improve the library: As pointed out in PR #18, most of the kernel methods compute the loss for each iteration by taking the norm of the difference between the utility matrix
Z
and the low-rank-reconstructionL
. This serves as a proxy for the true optimization target, which is the loss between the post-ReLU low-rank candidate matrixmax(0.0, L)
and the input sparse matrixX
. The proxy is fine in most cases, because by constructionZ
's only positive values are those of the original sparse matrixX
.However, momentum-based methods (such as the Aggressive Momentum method implemented in #18) risk breaking that property: the momentum effect may create positive values in
Z
that do not matchX
.This raises the question of whether it would be desirable in principle to compute the loss based on the actual reconstruction instead of the proxy, even when the momentum is not an issue.
Describe the solution you'd like If we decide to make this change, it would essentially involve promoting the
model_free_util.reconstruct_X_from_L
function introduced in PR #18 to a more general location, and rewriting thecompute_loss
function inloss_util.py
so that it always applies the reconstruction and uses the original sparse matrix, rather than the proxy.Describe alternatives you've considered The clear alternative is to leave the existing code unchanged, and continue using the proxy
Z - L
where it's currently used.Additional context Before making a change of this nature more generally, I would like to discuss with scientific stakeholders, to see if they agree there's a motivation to make the change; and also profile the cost of the change under various matrix sizes, to get a more quantitative sense of the performance impact under various scenarios.