Disable weight aware updates for lrq by default

memeplex commented 3 years ago

Short description

I'm not 100% sure about my reasoning here but in Importance Weight Aware Updates (https://arxiv.org/pdf/1011.1576.pdf) it's stated that:

In this paper we focus on linear models i.e. p = <w, x> where w is a vector of weights

and therefore:

all gradients of a given example point to the same direction and only differ in magnitude.

But AFAICS this is not true for factorization machine models, which are not linear on the parameters given that they involve products of weights. At first sight I don't see an immediate way to adapt the developments in that paper to a factorization machine model.

How this suggestion will help you/others

I usually work with very unbalanced samples from the adtech industry, so I heavily downsample negative examples. What I get is an estimate of the (original estimate of the real) loss, with inverse probability weighting, Horvitz–Thompson style. If weights are integer, both w1 l1 + ... + wn ln and (l1 + ... + l1) + ... + (ln + .... + ln) are equivalent and I could do SGD on one or the other, but I expect somewhat slow convergence in both cases:

if some weights are significantly larger than others, the first case (multiplying gradients by weights) could probably be described as having a worse condition number, or maybe worse Lipschitz constant, although I've never really thought about it.
the second case (descending wi times for each term i) is, well, repetitive, and doesn't generalize to non-integer weights.

This is why weight aware updates are important to me. But now I'm moving to factorization machine models and I'm afraid that these updates may not be compatible with them. At least the "multiplying gradients by weights" approach seems a correct one, even if not the most efficient one.

memeplex commented 3 years ago

I've commented about this in https://gitter.im/VowpalWabbit/community also.

JohnLangford commented 3 years ago

Is there evidence that this is helpful in practice? I'm not sure about whether or not disabling things is helpful in practice. For sure, the logic behind the updates does not apply, but there is a more heuristic value associated with a gradient update never overshooting which (in practice) is pretty useful.

There are two existing ways to disable importance weight aware updates: by not passing importance weights in the label and by explicitly turning them off in the update rule.

memeplex commented 3 years ago

Is there evidence that this is helpful in practice?

I don't think there is. Your opinion on this is probably the world's most authoritative one. I'm just trying to raise awareness in that it may be a thing, just in case it were an oversight, because I was unable to find any discussion in the issues/PRs that introduced lrq. But yes, in practice and as an heuristic it might still work fine.

VowpalWabbit / vowpal_wabbit

Disable weight aware updates for lrq by default #3220

Short description

How this suggestion will help you/others