Closed trufanov-nok closed 7 years ago
It looks like you are trying to combine -i and --feature_mask which wasn't really thought about. It's a reasonable thing to do, but I'm not sure the code has a consistent or easily described semantics for that at present (feel free to send a patch for one).
The purpose of feature_mask is to make it so l1 regularization can be used more effectively in a 2-pass manner. For example, if I do:
vw rcv1.train.raw.txt.gz --binary --readable_model foo.text --l1 1e-6 && wc -l foo.text
...
average loss = 0.058352
...
9201 foo.text
And then do: vw rcv1.train.raw.txt.gz --binary --readable_model foo.text --feature_mask save_model && wc -l foo.text ... average loss = 0.053408 ... 9201 foo.text
where using feature_mask controls the number of nonzero parameters after we turn off l1 regularization. We see the error rate has declined notably while preserving sparseness (which, with the sparse model changes, can now be achieved even at runtime).
Compare with single pass training using no regularization: ~/programs/vowpal_wabbit/vowpalwabbit/vw rcv1.train.raw.txt.gz --binary --readable_model foo.text && wc -l foo.text ... average loss = 0.057522 ... 35547 foo.text
A bigger model that performs worse.
@JohnLangford is that an apples-to-apples comparison?
Note the added --binary
which I suspect is the main reason for the improvement.
In my experience adding --binary
alone (w/o a feature mask) on many binary-class data-sets improves the average loss significantly since predictions are forced to {-1, +1} before loss is calculated.
A related shrinkage feature I would love to have is to be able to trim the last (lowest) N weights (or last P percent of weights) of a model plus the ability to multiply all weights by some constant (e.g. to make them all shrink proportionally towards zero) before the final long-tail trim.
vw
on disk model-files are already stored in 'dense' form in descending weight order so this should be easy to do (I have a python script that does it after the fact). Note that --l1
by itself doesn't "hard trim" the smallest weights. I have models in production that are updated forever with new data. For models that are permanently updated, low weights (and zero weights for features never seen before) tend to monotonically increase (and accumulate) over time till they 'clog' the model and make it less effective so real/hard trimming seems like a useful feature to have.
Does this make sense to you?
re: --binary that was a cut&paste fail. --binary was used.
My experience with long running models is that a gentle amount of --l1 addresses the issue. Why wouldn't that be the case?
I have contemplated adding a mode where you learn 2 regressors: one with --l1 and another without --l1 but with a feature mask imposed by the first. This should allow you to impose a somewhat strong --l1 without impairing prediction performance.
Thanks John.
Background: my use case is a bit unusual: I use vw
(very successfully, BTW) to do anomaly-detection on a large scale with a very large number of "one-hot-encoded" features. Since anomaly detection is by its nature an unsupervised learning problem, I learn with a constant single label with --noconstant
. Big relative errors vs the expected label, are considered anomalies. I also use --l1
.
My models are extremely sparse when I start. Over time though, rare features appear and their weights drift-up to sum as my constant label. Worse: model sparsity declines until hash collisions become a problem. Many anomaly-indicating features (zero weight in current model) appear rarely, but weights always go up (as expected). This forces me to hard trim the long tail of low-weighted features, plus shrink all weights (including the large ones) down on a regular basis to preserve sparsity. --l1
alone doesn't seem to work well enough. Now that I think of it, I may need to use a larger --l1 value
.
I think the weakness of --l1
is that it only trims new features on the fly. Ones that already have some weight in the dense, on-disk, model, even when that weight is very low, stick in the model forever.
--l1 only trims while learning is enabled, but as long as learning is enabled it should gradually reduce all weights towards 0. See here: https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/gd.h#L130 and here: https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/gd.cc#L579
--l2 operates as a multiplier on the weights, making them all smaller.
For efficiency reasons, this is imposed on the prediction rather than the weights, here: https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/gd.cc#L344
John, thanks so much for the pointers to the source and explanations! Looks like definitely my --l1 was too small.
When I get more time I'm thinking of cleaning-up my "after the fact" weight trimming and shrinking on finalized models script and contributing it. It may be useful to others...
Hello,
I've decided to play with --feature_mask functionality today and taken a look into code to find a place where I can adjust it for my purposes. And I was quite intrigued with this piece of code as I was going to use both
-i
and--feature_mask
. Wiki doesn't provide much details on how--feature_mask
works, but my expectation was that it locks all features which are 0s in mask from updating. Thus together with-i
it masks initial regressor. So I got confused with theI missed recent changes in VW's architecture regarding sparse weights thus I spend some time trying to understand new code. Can't say I fully understand it but I still sure something going wrong with it. Then I switched to tests.
Example:
This generated an empty model and all weights are 0. Let's test it.
It blocked regressor from learning as expected. Let's train a new model:
And combine it with our mask:
It learns! While itsn't supposed to be. I would say that
--feature_mask
completely ignored in presence of-i
... But it's not:Note the average loss change by 1e-6. It's not just rounding mistake. If you play with RunTests no. 30-32:
vs
The resulting average loss difference is significant. I don't understand why. Perhaps that's a result of model headers interlacing while regressor mask weights are ignored. So I end up with this and even doubt if it's a bug or I misunderstand how it supposed to work.