Multi-label Softmax Layer

jimxinbo commented 8 years ago

Hey, I did not see any support for multi-label softmax, i.e. each input sample can have more than one labels. Seems issue #270 is talking about a different story.

Mathematically, say, if there are C classes and if each sample have K labels represented as a vector L, where L(i) belongs to {1,...,C} for i=1,...,K. Then for each sample, the energy(or loss) is now \sum{i=1:K}{-ln(p{L(i)})}, when K=1, this is the regular softmax layer. However when K > 1, the backprop process needs some derivation to get it right.

Anyway, I coded up one such layer for my own need. In case anyone cares and this is indeed missing, I will consider pull a request to share.

piiswrong commented 8 years ago

SoftmaxOutput actually support this feature. Check it's parameters for usage.

jimxinbo commented 8 years ago

Hi piiswrong, thanks for the quick respond.

Did you mean the multi_output parameter?

I noticed it before. I do not think it is doing the same thing. For example, once multi_output is set to be true, the input data to the SoftmaxOutput is assumed to be a n x k x (x1,...,xn) dimensional input, instead a regular n x (x1,...,xn). Anyway, the case I am referring to is where you do not need to change the shape of input data matrix but that of the label matrix (from n x 1 to n x K).

Anyway, there might be a possibility that the multi_output parameter way and my implementation is fundamentally doing the same thing, but I highly doubt about it. Therefore, at least for the problem I mentioned above, the multi_output way, is either not doing the same thing or implemented in a less elegant way where you have to duplicate the input data K times.

I might be wrong because I did not went through it carefully, let me know your ideas.

piiswrong commented 8 years ago

I see what you mean. For multi target prediction, you can simply create n softmax symbols with the same input.

Kublai-Jing commented 8 years ago

@jimxinbo I believe the standard way to solve multi-label problem is to use logistic loss, i.e. each output is a {0,1} so that for each instance you can have multiple outputs = 1. See e.g. http://arxiv.org/pdf/1312.5419v3.pdf

jimxinbo commented 8 years ago

Hi @Kublai-Jing, thanks for the suggestion. Yes, you are right. The standard understanding of softmax is exactly to use this 1-of-K coding (name from PRML) where for each class, you have a {0,1} indicator. This can be easily generalized to multi-labels in the way as you mentioned. I derived the backprop accordingly. But from an implementation perspective, it maybe more desirable to use a label matrix of n x K instead of n x C (e.g. maybe C =1000, and K = 5). Of course when K is not fixed, your way is necessary. (I actually did implement it first in the way of 1-of-K coding, but were bothered by a big n x C matrix and turned to my current one.) Anyway, good point.

I think @piiswrong 's suggestion is a nice way to get around and allow us to work with multi-label using the current mxnet layers. The only weak point is that we will need to do K times of Forward passing where only one is necessary. This may or may not be a big deal. I am not sure.

Kublai-Jing commented 8 years ago

I see what you mean. But each forward pass you still get C-way softmax, and you need to do it K times, which is more expansive then just doing it once. The only advantage is that the input label matrix is at lower memory cost (as you mentioned, you only need NxK label matrix, not NxC). I myself probably would not think that is a good strategy. Also, from optimization standpoint, Softmax will push up the value of the target, and push down others. Yes you do it K times to ensure that each target value gets to be pushed up. But still there is some gradients that try to push down the target values, even though it really shouldn't be. Don't know if this is clear:

Say K = 2, C = 10 Take one data point (x, y) where y = (1,4) You will get gradients that try to raise the value of 1 and 4. But when you do a forward pass on y = 1, there is gradient that tries to push down 4, and vice versa.

jimxinbo commented 8 years ago

Perhaps I did not make myself clear.

We can do only one forward pass using using a N x K label matrix, this is what I did in my self-coded layer. All I am saying is that piiswrong's suggestion by using K (currently available) softmax layers will need to do K times of forward pass.

Regarding the gradient-push-down issue, I think you have a good intuition here. I also had this concern when I saw piiwrong's suggestion (of using K different softmax layers). But once you go into the math derivation, you will see it is actually equivalent. This is because the backprop gradient will be a summation of all the backprop from each labeled class and there is no explicit push down as you might imagined. Please check out the current way they implement softmax and you will probably see what I am saying here.

Anyway, I am pretty sure that my current implementation work correctly and in an memory efficient manner (one forward one backward and using only N x K memory). I am trying to be objective by saying that @piiswrong 's suggestion will work and allow us to use currently available layers. But for sure I am not a big fun of it and I think K time of forward passing is a bit wasteful.

pengpaiSH commented 8 years ago

@jimxinbo Could you please detail the way how to organize the current mxnet layers so that we can perform a multi-label image classification framework?

jimxinbo commented 8 years ago

Hi, @paipai880429, I end up using a multi-label softmax layer implemented by myself.

For organizing the current mxnet layers, as suggested by @piiswrong , you can create k (number of label for each sample) softmax symbols and assign one label to each.

pengpaiSH commented 8 years ago

@jimxinbo Do you mean that I should train K ConvNets, with each a binary classifier? If yes, then will have K times parameters and the training time. Furthermore, we have to generate 6 train.lst and train.bin.

jimxinbo commented 8 years ago

What I have in mind is one convnet, but the top layer is k softmax symbol. I believe this will be mathematically correct. Instead of back-proping gradient from one softmax, you are now back-proping from k softmax.

But you have a point, I am not sure weather mxnet support multiple output layers. Could you help @piiswrong

Worst case, I am glad to make the layer I made public.

pengpaiSH commented 8 years ago

@jimxinbo I noticed that "LogisticRegressionOutput" may output a multiple binary vector? Besides, if you made your own layer, what's the evaluation metric did you use for this multiple label training?

jimxinbo commented 8 years ago

In my problem, I assume I know the number of labels (say k). For evaluation metric, I check the largest k probability of the softmax layer and see if they match the ground truth labels.

pengpaiSH commented 8 years ago

@jimxinbo Would you please share your evaluation metric code?

jimxinbo commented 8 years ago

Sure, I am on to some deadline recently. I will send the code next week.

sxjzwq commented 8 years ago

Hi @jimxinbo Would you please share your multi-label softmax layer and the evaluation metric? Thanks!

jimxinbo commented 8 years ago

@paipai880429 @sxjzwq sorry for the late response due to spring festival. I made a crapy version here https://github.com/jimxinbo/multilabel-layer-mxnet

Please let me know if you have any questions.

pengpaiSH commented 8 years ago

@jimxinbo, thank you for your sharing~!

apache / mxnet

Multi-label Softmax Layer #910