The following things raise questions and I don't see why it's implemented this way:
The disc_loss is the cross-entropy between the true and predicted domain labels. This cross-entropy will of course be high when the discriminator is doing poorly and low otherwise. This is then added to disc_softmax[:, disc_labels].sum(), which seems to follow the opposite pattern. If the softmax output for the correct class is high (i.e. the discriminator is doing well), this value is high and low otherwise. By adding these two together, don't they cancel out? At least partially?
Whether optimizing the discriminator or predictor, the disc_loss on line 272 always includes the grad_penalty and F.cross_entropy(disc_out, disc_labels). When optimizing the discriminator, this is returned directly. Otherwise, you flip the sign and multiply by lambda. If I understand correctly, this effectively means that when optimizing the predictive loss, you try to maximize the penalty, making for a good domain classifier and thus making the features not domain invariant.
Why compute disc_softmax[:, disc_labels].sum()? Is there any work that suggests this is a better option instead of just -F.cross_entropy(disc_out, disc_labels) for a gradient reversal?
Something like this makes more sense to me (pseudo-code just to illustrate the idea):
I was going through the DANN implementation and there's a couple of things that seem off to me. My confusion mostly relates to lines 266-272:
The following things raise questions and I don't see why it's implemented this way:
disc_loss
is the cross-entropy between the true and predicted domain labels. This cross-entropy will of course be high when the discriminator is doing poorly and low otherwise. This is then added todisc_softmax[:, disc_labels].sum()
, which seems to follow the opposite pattern. If the softmax output for the correct class is high (i.e. the discriminator is doing well), this value is high and low otherwise. By adding these two together, don't they cancel out? At least partially?disc_loss
on line 272 always includes thegrad_penalty
andF.cross_entropy(disc_out, disc_labels)
. When optimizing the discriminator, this is returned directly. Otherwise, you flip the sign and multiply by lambda. If I understand correctly, this effectively means that when optimizing the predictive loss, you try to maximize the penalty, making for a good domain classifier and thus making the features not domain invariant.disc_softmax[:, disc_labels].sum()
? Is there any work that suggests this is a better option instead of just-F.cross_entropy(disc_out, disc_labels)
for a gradient reversal?Something like this makes more sense to me (pseudo-code just to illustrate the idea):