VowpalWabbit / vowpal_wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
https://vowpalwabbit.org
Other
8.47k stars 1.93k forks source link

no_bias_regularization not working #1729

Closed bscan closed 5 years ago

bscan commented 5 years ago

Describe the bug

The option --no_bias_regularization for BFGS doesn't seem to affect anything. This option was added here: https://github.com/VowpalWabbit/vowpal_wabbit/pull/1151 as a fix to this bug https://github.com/VowpalWabbit/vowpal_wabbit/issues/919

To Reproduce

Here is an example dataset that shows it has no effect, and this dataset also shows the nature of why I want to use the option. This is a simple toy example and is predicting height (in feet) from gender and name.

5.4 | Female Alice
5.8 | Male Bob
5.9 | Male Charlie

VW command with and without --no_bias_regularization appear to have the exact same coefficients.

vw -k -c --passes 20 --bfgs --holdout_off -d train.vw --readable_model readable.txt --l2 10 --no_bias_regularization

Expected behavior

Given a high value of L2 regularization, I would like the intercept to converge toward the global averages (somewhere around 5+ feet) and the other coefficients to shrink toward zero. This can improve generalization to new feature values and to missing features. In the example above, this would predict a Male Jamie to be 3 feet tall, and a Jamie with unknown gender to be 2 feet tall.

Observed Behavior

Nothing in the log indicates that --no_bias_regularization is being utilized.

Environment

VW 8.6.1 on Linux via command line

Additional context

I have included an Scikit-Learn example as a reference toward expected behavior where regularization is by default not applied to the intercept term. Doing a brief survey, it appears that Scikit-Learn, Spark MLLib and many other packages exclude the intercept from regularization by default. Setting it equal to the global mean (e.g. using 'base' input format) doesn't work either since the intercept in regression isn't always supposed to be the global mean.

train_data = pd.DataFrame({'Male': [0, 1, 1], 'Female':[1,0,0], 'Alice':[1,0,0],'Bob':[0,1,0], 'Charlie':[0,0,1], 'Height':[5.4,5.8,5.9]}) regr = linear_model.Ridge(alpha=100) regr.fit(train_data.drop(columns=['Height']), train_data['Height']) print(regr.intercept_)

5.699

pavithraks commented 5 years ago

Hey @JohnLangford ,

I started looking at this and I don't understand how "weights[constant]" in line 460 in bfgs.cc will point to the intercept. It always seems to be 0. As far as I can see, the constant variable is defined in "constants.h" and "weights[constant]" is used nowhere else in the code. I don't get where weights[constant] is actually assigned the intercept. Maybe, I'm missing something really fundamental here.

bscan commented 5 years ago

Hey @pavithraks, thanks for digging into this. I think that the constant is added to the feature space of each example in add_constant_feature (line 807 of parser.cc), which is also referencing the constant variable defined in "constant.h". Then whenever update_weight() is called, it should be updating the weight of the constant.

Then in line 447 of bfgs in add_regularization(): (&(*w))[W_GT] += regularization * (*w); it is updating the gradient for the regularization for all features (including the constant), and then trying to exactly undo that with line 467 (&weights[constant])[W_GT] -= regularization * weights[constant];

Edit: when it indexes into weights, it looks like it eventually calls out to array_parameters.h (line 126 inline weight& operator[](size_t i) to ensure the weight_mask is applied. So the constant is 11650396, but when indexing it will bitwise AND the weight_mask to get the constant projected into b bits (example 18 bits: constant = 116060).

I don't really know what the problem is, but one thing that looks odd is that in finalize_preconditioner(), it uses regularization for updating the preconditioner for each feature and doesn't subtract it out in the same way as it does in add_regularization(). Maybe it doesn't matter though since I think preconditioning is primarily about rescaling to improve convergence.

JohnLangford commented 5 years ago

@bscan seems correct here.

pavithraks commented 5 years ago

Hey @bscan @JohnLangford , I believe I found the bug and the solution.

Cause of problem: weights[constant] always returns 0, so there was no subtraction happening at all with the below two lines:

(&weights[constant])[W_GT] -= regularization * weights[constant]; ret -= 0.5 * regularization * (weights[constant]) * (weights[constant]);

This is because we weren't accessing it the right way.

weights.strided_index(constant) gives us what we want. Thus, the correct code should probably be:

(&weights.strided_index(constant))[W_GT] -= regularization * (weights.strided_index(constant)); ret -= 0.5*regularization*(weights.strided_index(constant))*(weights.strided_index(constant));

I am going to look at all the places where this happens and make changes. I hope to send in a fix in a couple of days.

Let me know if you have any thoughts. It was @kumpera 's suggestion to use strided_index. I was thinking of a more tortuous way of accessing the weight.

bscan commented 5 years ago

Thanks @pavithraks! I think this method makes sense and appears consistent with other uses in the code. For example, --constant lets the user set the initial value of the constant. This triggers:

if (g.initial_constant != 0.0)
      VW::set_weight(all, constant, 0, g.initial_constant); 

which calls:

inline void set_weight(vw& all, uint32_t index, uint32_t offset, float value)
{
  (&all.weights[((uint64_t)index) << all.weights.stride_shift()])[offset] = value;
}

So I suppose it makes sense that the primary thing missing is the stride_shift, even though I'm not sure what the stride_shift operator is for.

bscan commented 5 years ago

Thanks @pavithraks ! This is great.

I just tested this locally and it seems to be working. I reran the example I cited in the original comment.

Without this fix, the --readable_model yields the following coefficients

48758:1.06154
116060:1.60385
165961:0.5391
232213:0.542308
261698:0.522438 

Checking out the latest rev and rebuilding gives me:

48758:0.0409091
116060:5.68637
165961:0.0287878
232213:-0.0409087
261698:0.0121211

Where the key line is the constant's feature of 116060:5.68637, which is similar to the Scikit-Learn value of 5.699 and the actual label average of 5.7.