McSVMs do not have a test case for bias/maybe a bug?

Ulfgard commented 8 years ago

Assigning Tobias to it to get his attention.

I have just worked on the McSVMs and tried to unify the Bias solvers. On the way i figured out that there are no test cases for this. We need test cases for this non-trivial piece of code.

This becomes more serious as I stumbled across the following gradient computation of the bias in BiasSolver in QpMcBoxDecomp.h that I could not explain:

for (std::size_t i=0; i<numExamples; i++){
    for (std::size_t p=0; p<cardP; p++){
        double g = dualGradient(i,p);
        if (g > 0.0){
            unsigned int y = m_problem->label(i);
            typename QpSparseArray<QpFloatType>::Row const& row = nu.row(y * cardP + p);
            for (std::size_t b=0; b<row.size; b++) 
                grad(row.entry[b].index) -= row.entry[b].value;
        }
    }
}

from my understanding, the gradient stems from adding the equality constraint using the value of the bias as lagrangian, e.g. we get the following term in the dual:

-sum_c bc sum{i,p} nu_{yi,p,c) alpha{i,p}

where b_c is the value of the bias and everything to the right is the equality constraint. Thus when solving for alpha we get the following term added to the linear term of the dual:

-sum_c bc nu{y_i,p,c)

which is exactly what performBiasUpdate in the class does.

But if that logic is correct, the correct gradient must be computed as:

for (std::size_t i=0; i<numExamples; i++){
    unsigned int y = m_problem->label(i);
    for (std::size_t p=0; p<cardP; p++){
        double a = m_problem->alpha(i,p);
        if(a == 0) continue;//optimization
        typename QpSparseArray<QpFloatType>::Row const& row = nu.row(y * cardP + p);
        for (std::size_t b=0; b<row.size; b++) 
            grad(row.entry[b].index) -= row.entry[b].value * a;
    }
}

so what am I missing? Note that we can not use the current formulation in the linear solvers as we can not compute the gradients in a cheap way in that formulation(we also do not get the KKT conditions easily but as we change the linear term, we will loose accuracy in the order of max(deltaLinear) in performBiasUpdate)

TGlas commented 8 years ago

The idea of the original code was to compute the primal gradient w.r.t. the bias, but it did a lousy job. The check g > 0 intended to test whether the loss function is flat, however, the a == 0 test is more appropriate. Also, the multiplication with \alpha you introduced is certainly correct. Anyway, my simple test still fails, so that does not yet seem to be the end of the story. Still investigating...

Ulfgard commented 8 years ago

Ah, now i understand better and can see in the binary problem how it is derived. I am still not sure about the multi-class case, though...

do you mean that even my proposed gradient update is wrong?

TGlas commented 8 years ago

Your gradient is exactly what Kienzle and Schölkopf proposed. That was my original approach. I changed it later into something "more efficient" - turns out, something wrong. I have (locally) created a simple test case with four classes, one point each, linear kernel. The arrangement is nicely symmetric, which means that the optimal solution is obvious, and the same for all machines. It is currently found only by OVA :) The code is not checked in yet because it is full of cout and printf. This minimal example is sufficient to rule out the old "primal gradient" approach.

Your gradient is correct and a simple GD works, but only if the dual solver is trained to ridiculously high precision. With epsilon=1e-8 I can fit the bias to maybe three digits precision, but the resulting weight vectors are 10% off (the norm deviates, the orientation is fine, but still only one digit of precision). In its current form this is not useful. I don't know yet how to deal with this. Maybe simply do a fixed number of outer loop iterations (bias updates) and call it good, maybe I can come up with a better stopping criterion. The original paper does not have a good solution, only claims that it "just works". I can tell for sure that it does not work on my test case.

Aside from my synthetic test case I'd also like to test this on real data sets, in the vague hope that they may turn out to be easier. The tough part there is that we don't have any alternative solver for generating a "ground truth" solution. That's actually why I was pushing for SAG or SVRG interfaces - SGD is far too slow and inaccurate for reasonably sized data sets.

Ulfgard commented 8 years ago

why not using rprop? or is this your "simple GD"?

Ulfgard commented 8 years ago

the primal gradient should be along the lines of "the sum of gradients of the hinge loss wrt its inputs" right? I would like to help you here but I am still unsure about how the coefficient matrices and the hinge loss coincide

Ulfgard commented 8 years ago

what about a lagrangian aproach? e.g. AdMoM. Let Aalpha=c be the equality constraint that is introduced by the bias. Then by adMoM we can introduce a variable beta=alpha and have the constraint Abeta=c. Given some value of alpha we can solve this sub-problem in AdMoM efficiently.

also regarding the ground truth solution: we have for the linear case a viable SAG trainer that AFAIK can also cope with biases. so at least for the linear case we can do this provided we use the right hinge-loss generalization.

TGlas commented 8 years ago

My simple GD is really only GD. The only intelligence is in the learning rate adaptation. The initial value is "small". I keep track of the previous gradient. If the inner product of gradient and previous gradient is negative then I shrink the leanring rate, otherwise I increase it. Rprop would also do, but it makes it harder to respect the sum-to-zero constraint.

Forget about the primal gradient. In the "one point per class" data set the gradient w.r.t. the bias is exactly zero. Of course, the gradient w.r.t. bias and alpha at the same time would work. For this case my problem decomposition does not work at all.

I am unsure about the AdMoM details, but I think it is pretty much in line with the spirit of the Kienzle and Schölkopf approach. The problem of finding the optimal bias for a given alpha (or w) is a linear program. In the first version of the code I had extracted the simplex solver from the GNU scientific library. Removing that in favor or the primal approach was a great relief - fixing all compiler warnings in that horrible C code is a lifetime task. Also, GSL is under the GPL license, not LGPL. I did a search recently and I did not find a satisfactory linear program solver in C++. I really don't want to roll my own :)

Thanks for the hint, I'll use the linear SAG trainer.

Ulfgard commented 8 years ago

how can this be 0? if ou solve perfectly for alpha and the gradient for b is 0, you are done.

TGlas commented 8 years ago

I see your point. This is how I arrived at the above conclusion:

Assume n points in n-dimensional space, in general position, one per class (n classes). We use a linear kernel. Let's fix the bias to zero and solve for the weight vectors. We have n weight vectors available, that's enough degrees of freedom to model arbitrary values of the n functions on the training points. Hence there is a hard-margin solution. Then all loss terms are zero, and therefore zero is a sub-gradient of the primal objective with respect to the bias. This can be achieved no matter how we shift the data (which would usually be compensated by changing the bias), at least as long as the points remain linearly independent.

Now let's fix the bias to an arbitrary value, then the same arguments hold. There should be one value of the bias for which the resulting sum of squared norms of the weight vectors is minimal (or maybe a one-dimensional sub-space for relative margins, which collapses to a point with the sum-to-zero constraint). Only that solution is globally optimal.

If there is no flaw in the above arguments then I have impicitly constructed a sub-optimal solution with vanishing primal gradient. That's of course a contradiction. What am I missing?

Ulfgard commented 8 years ago

Just to give you a hint: this situation is no different than when using an rbf kernel in the binary case for as many points as you like (hard margin solution always exists). As we have a unique bias there, we can conclude that there is a mistake in your reasoning.

I do not see why there should be a sub-optimal solution? The first paragraph states that, not taking the regularization into account, we can model every function of n points arbitrarily well. Leading to a zero hinge loss, which leads to the primal gradient of b being zero. But in reality we regularize and thus our hinge-loss will be non-zero in practice.

That said, a sum of hinge-loss terms is convex and the regularizer is strictly convex. the latter ensures that the optimum is unique.

Ulfgard commented 8 years ago

Ahh I understand now. You choose so little regularization that the hard margin in the dual can be achieved as C>the largest value. Now we can fix the bias, find a solution which leads to all hinge-losses being 0.

hm are you sure that you can still find the optimum with a nonzero bias? i feel that it takes at least one degree of freedom to correct for it.

TGlas commented 8 years ago

Yes, I am pretty sure. SVMs without bias work well, also hard-margin. All we have to do is change the definition of linear separability from "there exists a function <w,x>+b that separates the data" to "there exists a function <w,x> that separates the data". To separate here means that all margins (relative or absolute, pick one) are positive. In addition we may or may not enforce the sum-to-zero constraint. With enough degrees of freedom (e.g., Gaussian kernel) this is always possible. So... if the data is linearly separable, by definition, there exists a separation. It is not unique. The hard-margin SVM is simply the one with the largest margin.

Anyway, your hint at the binary problem with Gaussian kernel is still a valid point. Later when I have time I'll try to start from there.

Ulfgard commented 8 years ago

so the problem seems to be that, unless b is optimal dual gradient of alpha = 0 does not imply primal gradient of alpha = 0, right? because otherwise primal gradient of b = 0 implies optimality.

//edit: In Math terms: the dual solution is not primal feasible.

Ulfgard commented 8 years ago

I found the solution. its obvious.

If we change b in the primal, we change the linear part in the dual. so dual gradient of w = 0 does for b!= 0 not imply hard margin solution as we are solving a different problem. therefore the test if(g > 0) in the primal gradient computation does not test for "hinge loss is flat" because we have to correct for the current effect of b.

TGlas commented 8 years ago

I am not sure I understand you correctly. My problem is more fundamental. Forget about the dual, entirely. We don't even need kernels or the multi-class SVMs.

Let's talk only about the primal, in unconstrained form (with hinge loss, no slack variables). This is a convex problem. It is strictly convex in w, but not in b. Let P(w, b) denote the primal objective function.

Fix b to an arbitrary value. Solve for w. For conceptual clarity we solve the primal problem directly (still no dual). So we know that 0 \in \nabla_w P(w, b), i.e., zero is a sub-gradient of P wrt w. If C happenes to be large enough for a hard-margin solution then we also find 0 \in \nabla_b P(w, b). At the same time we know that the solution is in general not optimal, since solutions depend on b, and b was arbitrary.

Still the same contradiction. I am really stuck with this. Can you explain your above argument in these terms?

Ulfgard commented 8 years ago

Could it be that sums of subgradients are more complicated than we thought?

TGlas commented 8 years ago

Yes. I just discussed this problem with Christian and Ürün, and I arrived at the same conclusion. If zero is a sub-gradient of the problem restricted to each sub-space then that's not enough for optimality. So it is better to work entirely in the dual where the objective is smooth.

Ulfgard commented 8 years ago

which solver are you now targeting? I would still suggest an augmented lagrangian approach where we add the terms

inner_prod(beta,prod(A,alpha))+1/2 norm_sqr(prod(A,alpha))

to the dual for the constraint A alpha = 0- This gives an easy way to update beta:

beta -= c*A alpha

and i guess there is some hope of incorporating it nicely into the dual...

Ulfgard commented 7 years ago

can I get your test code for this? I would like to discuss with Kenny, why it may break

TGlas commented 7 years ago

can I get your test code for this? I would like to discuss with Kenny, why it may break What exactly do you need?

Shark-ML / Shark

McSVMs do not have a test case for bias/maybe a bug? #103