VowpalWabbit / vowpal_wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
https://vowpalwabbit.org
Other
8.49k stars 1.93k forks source link

Disable weight aware updates for lrq by default #3220

Open memeplex opened 3 years ago

memeplex commented 3 years ago

Short description

I'm not 100% sure about my reasoning here but in Importance Weight Aware Updates (https://arxiv.org/pdf/1011.1576.pdf) it's stated that:

In this paper we focus on linear models i.e. p = <w, x> where w is a vector of weights

and therefore:

all gradients of a given example point to the same direction and only differ in magnitude.

But AFAICS this is not true for factorization machine models, which are not linear on the parameters given that they involve products of weights. At first sight I don't see an immediate way to adapt the developments in that paper to a factorization machine model.

How this suggestion will help you/others

I usually work with very unbalanced samples from the adtech industry, so I heavily downsample negative examples. What I get is an estimate of the (original estimate of the real) loss, with inverse probability weighting, Horvitz–Thompson style. If weights are integer, both w1 l1 + ... + wn ln and (l1 + ... + l1) + ... + (ln + .... + ln) are equivalent and I could do SGD on one or the other, but I expect somewhat slow convergence in both cases:

This is why weight aware updates are important to me. But now I'm moving to factorization machine models and I'm afraid that these updates may not be compatible with them. At least the "multiplying gradients by weights" approach seems a correct one, even if not the most efficient one.

memeplex commented 3 years ago

I've commented about this in https://gitter.im/VowpalWabbit/community also.

JohnLangford commented 3 years ago

Is there evidence that this is helpful in practice? I'm not sure about whether or not disabling things is helpful in practice. For sure, the logic behind the updates does not apply, but there is a more heuristic value associated with a gradient update never overshooting which (in practice) is pretty useful.

There are two existing ways to disable importance weight aware updates: by not passing importance weights in the label and by explicitly turning them off in the update rule.

memeplex commented 3 years ago

Is there evidence that this is helpful in practice?

I don't think there is. Your opinion on this is probably the world's most authoritative one. I'm just trying to raise awareness in that it may be a thing, just in case it were an oversight, because I was unable to find any discussion in the issues/PRs that introduced lrq. But yes, in practice and as an heuristic it might still work fine.