LayneH / SAT-selective-cls

Self-Adaptive Traning for Selective Classification.
MIT License
2 stars 4 forks source link

criterions of training and test are mixed up #3

Open DingQiang2018 opened 2 years ago

DingQiang2018 commented 2 years ago

https://github.com/LayneH/SAT-selective-cls/blob/dc5559358b6bf59e23ce988baa6b8430626c8822/train.py#L200-L201

It might be a mistake to use the same criterion in function train and function test, which mixes up histroy of predictions of the model on training set and that on test set .

LayneH commented 2 years ago

Thank you for pointing it out. This is indeed a bug in our code that we should not pass the SAT criterion to the test() function.

I have rerun the experiments after fixing this bug and found that the performance is slightly improved.

DingQiang2018 commented 2 years ago

Could you push your updated code to this repository? I did not get better performances after I fixed the bug and reran the experiments.

LayneH commented 2 years ago

Hi,

Please refer to the latest commit. The scripts should produce slightly (if noticeable) better results than the reported ones.

DingQiang2018 commented 2 years ago

Hi, I find that even though I use updated code, I can not reproduce the results on CIFAR10 as reported in your paper. My results are following:

coverage mean stantard devariance
100 6.008 0.138
95 3.724 0.028
90 2.064 0.045
85 1.187 0.031
80 0.656 0.002
75 0.406 0.051
70 0.298 0.055

As the table shows, the selective error rate for 95% coverage is 3.72%, which is far away from (3.37±0.05)%. Could you help me solve this problem?

DingQiang2018 commented 2 years ago

I am sorry for not explaining mean and standard deviation in the last comment. In the table of the last comment, mean and standard deviation refer to mean of selective error rate and standard deviation of selective error rate respectively, which are calculated over 3 trials.

LayneH commented 2 years ago

Hi,

It seems that most entries are pretty close to or better than the reported ones in the paper except the case of 95% coverage.

I have checked the experiment logs and found that some of the CIFAR10 experiments (but none of the experiments on other datasets) are based on an earlier implementation of SAT, which slightly differs from the current implementation in this line

# current implementation
soft_label[torch.arange(y.shape[0]), y] = prob[torch.arange(y.shape[0]), y]
# earlier implementation
soft_label[torch.arange(y.shape[0]), y] = 1

You can try this to see the performance.

DingQiang2018 commented 2 years ago

Hi, I reran the experiments and got results as following (with the earlier implementation of SAT). mean and std dev refer to mean of selective error rate and standard deviation of selective error rate respectively in this table.

Test    mean    std dev
100 5.854   0.216
95  3.603   0.133
90  1.978   0.117
85  1.109   0.046
80  0.683   0.070
75  0.433   0.044
70  0.303   0.031

The performance is better than that of the current implementaton of SAT. But the selective error rate of coverage 95%, 3.603%, is still not so good as the reported one, (3.37±0.05)%, in your paper. Perhaps you had made a clerical mistake in your paper?

Jordy-VL commented 1 year ago

Interesting reproduction analysis, did this eventually get resolved? Should one use the main branch for reproductions?

DingQiang2018 commented 1 year ago

Interesting reproduction analysis, did this eventually get resolved? Should one use the main branch for reproductions?

No, I gave up. This repository does not provide the random seed manualSeed, making it challenging to reproduce the results.

Jordy-VL commented 1 year ago

Might I ask you if you know of any other selective classification methods that 'actually work'? I was looking into self-adaptive training as well, which seems related.

DingQiang2018 commented 1 year ago

As far as I know, Deep Ensemble [1] really works and might be the most powerful method. However, considering the heavy computational overhead of ensemble models, recent work in selective classification focuses on individual models. These models (e.g., [2][3]) exhibit marginal improvement from Softmax Response [4]. The advance in this line of work seems neither significant nor exciting. Nevertheless, my survey might be not comprehensive. A more comprehensive survey might be found in [5][6].

[1] Lakshminarayanan et al. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In NIPS, 2017. [2] Liu et al. Deep Gamblers: Learning to Abstain with Portfolio Theory. In NeurIPS, 2019. [3] Feng et al. Towards Better Selective Classification. In ICLR, 2023. [4] Geifman and El-Yaniv. Selective Classification for Deep Neural Networks. In NIPS, 2017. [5] Gawlikowski et al. A Survey of Uncertainty in Deep Neural Networks. arXiv:2107.03342. [6] Galil et al. What Can we Learn from The Selective Prediction and Uncertainty Estimation Performance Of 523 ImageNet Classifiers? In ICLR, 2023.