Question about implementing classification loss

pp1016 commented 3 years ago

I found for previous methods such as LWF.MC, ICARL, EEIL, they all use binary cross-entropy loss to calucate both the distillation term and classification term, while in your codes, the binary cross-entropy loss is replaced by the cross-entropy for classification term. I am wondering if there exist difference between binary cross-entropy and cross-entropy? From my own implementation, the performance is better if I only apply binary cross-entropy, but acutually the problem is a multi-class classification task, so I am confused about the results. Thanks.

arthurdouillard commented 3 years ago

Actually my code of iCaRL does use BCE (https://github.com/arthurdouillard/incremental_learning.pytorch/blob/0d25c2e12bde4a4a25f81d5e316751c90e6f789b/inclearn/models/icarl.py#L361).

I think the original paper of EEIL use softmax+CE, are you sure?

Sigmoid+BCE seems more robust than Softmax+CE in continual learning, I do not have yet a clear explanation, but here are some intuitions:

an old paper showed that BCE leaded to better representation in metric learning (thus a family less sensitive to forgetting)
there is less an interaction between tasks by using sigmoid instead of softmax

pp1016 commented 3 years ago

Actually my code of iCaRL does use BCE (

https://github.com/arthurdouillard/incremental_learning.pytorch/blob/0d25c2e12bde4a4a25f81d5e316751c90e6f789b/inclearn/models/icarl.py#L361 ).

I think the original paper of EEIL use softmax+CE, are you sure?

Sigmoid+BCE seems more robust than Softmax+CE in continual learning, I do not have yet a clear explanation, but here are some intuitions:
* an old paper showed that BCE leaded to better representation in metric learning (thus a family less sensitive to forgetting)

* there is less an interaction between tasks by using sigmoid instead of softmax

Thanks for the reply! another question is about the implementation of knowledge distillation loss. In some codes such as ICARL, it is done by using BCE loss while others such as BIC is using KL Divergence loss. I am wondering if there is any difference between these two? If we plan to use Sigmoid+BCE for classification, can we still choose KL Divergence for distillation? For my own implementation, the effectiveness of KD by using Sigmoid+BCE + BCE(KD) is much better than Sigmoid+BCE + KLD(KD). Does this indicate that we should use Sigmoid+BCE + BCE(KD) or Softmax+CE + KLD(KD)? Thanks!

arthurdouillard commented 3 years ago

Yes Softmax+KL for kd is different from Sigmoid+BCE.

I've seen experiments where one was better, and something the other was better, so I don't really have a good intuition about it. Although it seems that with KL it's more important to have a well tuned temperature.

pp1016 commented 3 years ago

Yes Softmax+KL for kd is different from Sigmoid+BCE.

I've seen experiments where one was better, and something the other was better, so I don't really have a good intuition about it. Although it seems that with KL it's more important to have a well tuned temperature.

Yes I agreed. Using Sigmoid+BCE to implement KD is hard to tune the temperature T. Thanks for the reply!

arthurdouillard / incremental_learning.pytorch

Question about implementing classification loss #47