Open Lyttonkeepfoing opened 1 year ago
Thanks for your interest in our paper. The Baseline is MSP with standard training, and we did not use temperature scaling at training or inference time. The code about kdloss and cwd_weight are just other useless methods that we have tried. We will remove them to avoid confusion.
Thanks for your respond! And I have one more question: In your paper, you said that :We randomly sample 10% of
training samples as a validation dataset for each task because it is a requirement for post-calibration methods like temperature scaling. So the results reported in your paper is trained on 45000 samples and tested them on validation set ? It seems like there is no codes about your test operation.
-------------------Make loader-------------------
Train Dataset : 45000 Valid Dataset : 5000 Test Dataset : 10000
If you test them on test dataset directly, your train dataset have 45000 samples not 50000.
I think this is really important.
Yes, the model is trained on 45000 samples and tested on the original test dataset (10000); If you want to train the model on all training set, just modify the code (line 115) in utils/data.py. The results of training on all training set can also be found in our CVPR2023 paper "OpenMix: Exploring Out-of-Distribution samples for Misclassification Detection".
YEAH, I noticed your paper about OpenMix. The methods in your table 2:Doctor [NeurIPS21] [19]ODIN [ICLR18] [38]Energy [NeurIPS20] [39]MaxLogit [ICML22] [23] LogitNorm and Mc_dropout , trust score , tcp . Do you reimplement this method in your OpenMix Repo? Although some of them are Post-hoc method, I think make a comparison in the same training setting is important. The accuracy impacts the other metrics a lot. I'm so sorry to ask you so many questions, you're a good researcher in failure prediction field. Learn a lot from your works~
Following your suggestion, we will upload our implementation of Doctor, ODIN, Energy, MaxLogit and LogitNorm. As for Mc_dropout , trust score and tcp, we just used the results in the TPAMI version of TCP paper. In our papers, we also emphasized that classification accuracy is important. For example, LogitNorm itself has low accuracy than baseline because it constrains the norm during training. For TCP, with commonly used standard training (e.g., SGD, learning rate schedule, etc), there is few misclassified samples in training set for learning the confidnet. Actually, it is really hard and may be impossible to keep the same accuracy for all compared methods, and we therefore report the accuracy along with other confidence estimation metrics.
That's exactly what I think~ Looking forward to you updatting your repo.
It's a really nice repo. I read your paper and wondering the baseline you set is MSP+Temperature Scaling? But I could not find the Temperture Scaling operations in your code. parser.add_argument('--cwd_weight', default=0.1, type=float, help='Trianing time tempscaling') the option here class KDLoss(nn.Module): def init(self, temp_factor): super(KDLoss, self).init() self.temp_factor = temp_factor self.kl_div = nn.KLDivLoss(reduction="sum")
kdloss = KDLoss(2.0) the KDL loss here. And Temperre Scaling is used in training time not inference? You said it's a post-hoc method, so you should use it in your inference time? Could you help me with this confusion?