Confused by --feature_mask a.model -i b.model

Hello,

I've decided to play with --feature_mask functionality today and taken a look into code to find a place where I can adjust it for my purposes. And I was quite intrigued with this piece of code as I was going to use both -i and --feature_mask. Wiki doesn't provide much details on how --feature_mask works, but my expectation was that it locks all features which are 0s in mask from updating. Thus together with -i it masks initial regressor. So I got confused with the

// Re-zero the weights, in case weights of initial regressor use different indices
all.weights.set_zero(0);

I missed recent changes in VW's architecture regarding sparse weights thus I spend some time trying to understand new code. Can't say I fully understand it but I still sure something going wrong with it. Then I switched to tests.

Example:

$ ../vowpalwabbit/vw -d train-sets/0001.dat -f models/mask.model --redefine a:=: --ignore a --noconstant --initial_weight 0
ignoring namespaces beginning with: a 
final_regressor = models/mask.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = train-sets/0001.dat
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0   1.0000   0.0000        0
0.500000 0.000000            2            2.0   0.0000   0.0000        0
0.250000 0.000000            4            4.0   0.0000   0.0000        0
0.250000 0.250000            8            8.0   0.0000   0.0000        0
0.312500 0.375000           16           16.0   1.0000   0.0000        0
0.343750 0.375000           32           32.0   0.0000   0.0000        0
0.359375 0.375000           64           64.0   0.0000   0.0000        0
0.414062 0.468750          128          128.0   1.0000   0.0000        0

finished run
number of examples per pass = 200
passes used = 1
weighted example sum = 200.000000
weighted label sum = 91.000000
average loss = 0.455000
best constant = 0.455000
best constant's loss = 0.247975
total feature number = 0

This generated an empty model and all weights are 0. Let's test it.

$ ../vowpalwabbit/vw -d train-sets/0001.dat --feature_mask models/mask.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = train-sets/0001.dat
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0   1.0000   0.0000       51
0.500000 0.000000            2            2.0   0.0000   0.0000      104
0.250000 0.000000            4            4.0   0.0000   0.0000      135
0.250000 0.250000            8            8.0   0.0000   0.0000      146
0.312500 0.375000           16           16.0   1.0000   0.0000       24
0.343750 0.375000           32           32.0   0.0000   0.0000       32
0.359375 0.375000           64           64.0   0.0000   0.0000       61
0.414062 0.468750          128          128.0   1.0000   0.0000      106

finished run
number of examples per pass = 200
passes used = 1
weighted example sum = 200.000000
weighted label sum = 91.000000
average loss = 0.455000
best constant = 0.455000
best constant's loss = 0.247975
total feature number = 15482

It blocked regressor from learning as expected. Let's train a new model:

$ ../vowpalwabbit/vw -d train-sets/0001.dat -f models/mask2.model
final_regressor = models/mask2.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = train-sets/0001.dat
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0   1.0000   0.0000       51
0.513618 0.027236            2            2.0   0.0000   0.1650      104
0.263121 0.012624            4            4.0   0.0000   0.0569      135
0.237739 0.212356            8            8.0   0.0000   0.2024      146
0.242021 0.246303           16           16.0   1.0000   0.3249       24
0.235878 0.229736           32           32.0   0.0000   0.2256       32
0.230921 0.225964           64           64.0   0.0000   0.1601       61
0.223511 0.216101          128          128.0   1.0000   0.8308      106

finished run
number of examples per pass = 200
passes used = 1
weighted example sum = 200.000000
weighted label sum = 91.000000
average loss = 0.195760
best constant = 0.455000
best constant's loss = 0.247975
total feature number = 15482

And combine it with our mask:

$ ../vowpalwabbit/vw -d train-sets/0001.dat --feature_mask models/mask.model -i models/mask2.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = train-sets/0001.dat
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0   1.0000   1.0000       51
0.135919 0.271838            2            2.0   0.0000   0.5214      104
0.103277 0.070636            4            4.0   0.0000   0.1259      135
0.067830 0.032383            8            8.0   0.0000   0.0000      146
0.048773 0.029715           16           16.0   1.0000   0.8732       24
0.033231 0.017688           32           32.0   0.0000   0.1573       32
0.028492 0.023754           64           64.0   0.0000   0.0538       61
0.024258 0.020024          128          128.0   1.0000   0.8495      106

finished run
number of examples per pass = 200
passes used = 1
weighted example sum = 200.000000
weighted label sum = 91.000000
average loss = 0.021872
best constant = 0.455000
best constant's loss = 0.247975
total feature number = 15482

It learns! While itsn't supposed to be. I would say that --feature_mask completely ignored in presence of -i... But it's not:

$ ../vowpalwabbit/vw -d train-sets/0001.dat -i models/mask2.model    
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = train-sets/0001.dat
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0   1.0000   1.0000       51
0.135919 0.271838            2            2.0   0.0000   0.5214      104
0.103277 0.070636            4            4.0   0.0000   0.1259      135
0.067830 0.032383            8            8.0   0.0000   0.0000      146
0.048773 0.029715           16           16.0   1.0000   0.8732       24
0.033231 0.017688           32           32.0   0.0000   0.1573       32
0.028492 0.023754           64           64.0   0.0000   0.0538       61
0.024258 0.020024          128          128.0   1.0000   0.8495      106

finished run
number of examples per pass = 200
passes used = 1
weighted example sum = 200.000000
weighted label sum = 91.000000
average loss = 0.021871
best constant = 0.455000
best constant's loss = 0.247975
total feature number = 15482

Note the average loss change by 1e-6. It's not just rounding mistake. If you play with RunTests no. 30-32:

../vowpalwabbit/vw -d train-sets/0001.dat -f models/mask.model --invert_hash mask.predict --l1 0.01
../vowpalwabbit/vw -d train-sets/0001.dat --invert_hash remask.predict --feature_mask models/mask.model -f models/remask.model
../vowpalwabbit/vw -d train-sets/0001.dat --feature_mask models/mask.model -i models/remask.model

../vowpalwabbit/vw -d train-sets/0001.dat -f models/mask.model --invert_hash mask.predict --l1 0.01
../vowpalwabbit/vw -d train-sets/0001.dat --invert_hash remask.predict --feature_mask models/mask.model -f models/remask.model
../vowpalwabbit/vw -d train-sets/0001.dat -i models/remask.model

The resulting average loss difference is significant. I don't understand why. Perhaps that's a result of model headers interlacing while regressor mask weights are ignored. So I end up with this and even doubt if it's a bug or I misunderstand how it supposed to work.

It looks like you are trying to combine -i and --feature_mask which wasn't really thought about. It's a reasonable thing to do, but I'm not sure the code has a consistent or easily described semantics for that at present (feel free to send a patch for one).

The purpose of feature_mask is to make it so l1 regularization can be used more effectively in a 2-pass manner. For example, if I do:

vw rcv1.train.raw.txt.gz --binary --readable_model foo.text --l1 1e-6 && wc -l foo.text
... average loss = 0.058352 ... 9201 foo.text

And then do: vw rcv1.train.raw.txt.gz --binary --readable_model foo.text --feature_mask save_model && wc -l foo.text ... average loss = 0.053408 ... 9201 foo.text

where using feature_mask controls the number of nonzero parameters after we turn off l1 regularization. We see the error rate has declined notably while preserving sparseness (which, with the sparse model changes, can now be achieved even at runtime).

Compare with single pass training using no regularization: ~/programs/vowpal_wabbit/vowpalwabbit/vw rcv1.train.raw.txt.gz --binary --readable_model foo.text && wc -l foo.text ... average loss = 0.057522 ... 35547 foo.text

A bigger model that performs worse.

@JohnLangford is that an apples-to-apples comparison? Note the added --binary which I suspect is the main reason for the improvement.

In my experience adding --binary alone (w/o a feature mask) on many binary-class data-sets improves the average loss significantly since predictions are forced to {-1, +1} before loss is calculated.

A related shrinkage feature I would love to have is to be able to trim the last (lowest) N weights (or last P percent of weights) of a model plus the ability to multiply all weights by some constant (e.g. to make them all shrink proportionally towards zero) before the final long-tail trim.

vw on disk model-files are already stored in 'dense' form in descending weight order so this should be easy to do (I have a python script that does it after the fact). Note that --l1 by itself doesn't "hard trim" the smallest weights. I have models in production that are updated forever with new data. For models that are permanently updated, low weights (and zero weights for features never seen before) tend to monotonically increase (and accumulate) over time till they 'clog' the model and make it less effective so real/hard trimming seems like a useful feature to have.

Does this make sense to you?

re: --binary that was a cut&paste fail. --binary was used.

My experience with long running models is that a gentle amount of --l1 addresses the issue. Why wouldn't that be the case?

I have contemplated adding a mode where you learn 2 regressors: one with --l1 and another without --l1 but with a feature mask imposed by the first. This should allow you to impose a somewhat strong --l1 without impairing prediction performance.

Thanks John.

Background: my use case is a bit unusual: I use vw (very successfully, BTW) to do anomaly-detection on a large scale with a very large number of "one-hot-encoded" features. Since anomaly detection is by its nature an unsupervised learning problem, I learn with a constant single label with --noconstant. Big relative errors vs the expected label, are considered anomalies. I also use --l1.

My models are extremely sparse when I start. Over time though, rare features appear and their weights drift-up to sum as my constant label. Worse: model sparsity declines until hash collisions become a problem. Many anomaly-indicating features (zero weight in current model) appear rarely, but weights always go up (as expected). This forces me to hard trim the long tail of low-weighted features, plus shrink all weights (including the large ones) down on a regular basis to preserve sparsity. --l1 alone doesn't seem to work well enough. Now that I think of it, I may need to use a larger --l1 value.

I think the weakness of --l1 is that it only trims new features on the fly. Ones that already have some weight in the dense, on-disk, model, even when that weight is very low, stick in the model forever.

--l1 only trims while learning is enabled, but as long as learning is enabled it should gradually reduce all weights towards 0. See here: https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/gd.h#L130 and here: https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/gd.cc#L579

--l2 operates as a multiplier on the weights, making them all smaller.
For efficiency reasons, this is imposed on the prediction rather than the weights, here: https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/gd.cc#L344

John, thanks so much for the pointers to the source and explanations! Looks like definitely my --l1 was too small.

When I get more time I'm thinking of cleaning-up my "after the fact" weight trimming and shrinking on finalized models script and contributing it. It may be useful to others...

VowpalWabbit / vowpal_wabbit

Confused by --feature_mask a.model -i b.model #1217