Closed XuHQ1997 closed 3 years ago
The leaky clamp operation looks like this. As you see, the problem you described does not exist.
Personally, I prefer piece-wise linear activation functions, which is why I usually use leaky ReLU or leaky clamp. Since I use leaky clamp in the output layer (instead of sigmoid), there will be negative values in the outputs, therefore BCE cannot be applied.
Oh, sorry~ I just confused loss and gradient. The gradient may be greater when the prediction is closer to 1. Emm...It seems not a big problem, right?
Thanks for your reply.
When computing the gradient, you need to consider the ground truth. The gradients you showed on the figure seem to assume the ground truth is 1.
It is actually better to consider the gradient as the product of the gradient from MSE and the constant gradient (1 or 0.01) from this piece-wise linear function.
Yes, you'r right. Because negative samples are much more than the positive samples, I guess the network would struggle to learn the positive samples. So I discussed about the positive samples. And I wonder whether the imbalance between positive and negative samples is the main reason for which we need train the network so long.
I would consider the network structure as the main reason, i.e., the continuity/smoothness of the functions represented by MLPs.
Ok. Thanks for your patient :)
Thanks for the codes!
But the loss function used here is confusing to me. I notice that a (leaky) clamp operation is applied to the output of generator.
l7 = torch.max(torch.min(l7, l7*0.01+0.99), l7*0.01)
Then, MSE Loss is used.Here is the problem. Let's say a positive sample whose target is 1, then a prediction for it is in range of [0, 1), and the other one is negtative. Naturally, we expect that the loss of the former should be less than the latter. But via the loss function used here, that is not sure. There will be a local minimum in the negative area. Do you think this is gonna make it difficult to train?
By the way, is there any reason why you don't use BCE or NLL loss here? Or, any advise about the loss function?
Looking forward for your reply.