Open memeplex opened 3 years ago
I've commented about this in https://gitter.im/VowpalWabbit/community also.
Is there evidence that this is helpful in practice? I'm not sure about whether or not disabling things is helpful in practice. For sure, the logic behind the updates does not apply, but there is a more heuristic value associated with a gradient update never overshooting which (in practice) is pretty useful.
There are two existing ways to disable importance weight aware updates: by not passing importance weights in the label and by explicitly turning them off in the update rule.
Is there evidence that this is helpful in practice?
I don't think there is. Your opinion on this is probably the world's most authoritative one. I'm just trying to raise awareness in that it may be a thing, just in case it were an oversight, because I was unable to find any discussion in the issues/PRs that introduced lrq. But yes, in practice and as an heuristic it might still work fine.
Short description
I'm not 100% sure about my reasoning here but in Importance Weight Aware Updates (https://arxiv.org/pdf/1011.1576.pdf) it's stated that:
and therefore:
But AFAICS this is not true for factorization machine models, which are not linear on the parameters given that they involve products of weights. At first sight I don't see an immediate way to adapt the developments in that paper to a factorization machine model.
How this suggestion will help you/others
I usually work with very unbalanced samples from the adtech industry, so I heavily downsample negative examples. What I get is an estimate of the (original estimate of the real) loss, with inverse probability weighting, Horvitz–Thompson style. If weights are integer, both
w1 l1 + ... + wn ln
and(l1 + ... + l1) + ... + (ln + .... + ln)
are equivalent and I could do SGD on one or the other, but I expect somewhat slow convergence in both cases:This is why weight aware updates are important to me. But now I'm moving to factorization machine models and I'm afraid that these updates may not be compatible with them. At least the "multiplying gradients by weights" approach seems a correct one, even if not the most efficient one.