[Re] When Does Label Smoothing Help?

sdwagner commented 1 year ago

Original article: Rafael Müller, Simon Kornblith, and Geoffrey E. Hinton. "When does label smoothing help?." Advances in neural information processing systems 32 (2019). (https://arxiv.org/pdf/1906.02629.pdf)

PDF URL: https://github.com/sdwagner/re-labelsmoothing/blob/main/report/article.pdf Metadata URL: https://github.com/sdwagner/re-labelsmoothing/blob/main/report/metadata.yaml Code URL: https://github.com/sdwagner/re-labelsmoothing

Scientific domain: Machine Learning Programming language: Python Suggested editor: Georgios Detorakis or Koustuv Sinha

rougier commented 1 year ago

Thanks for your submission, we'll assign an editor soon.

rougier commented 1 year ago

@koustuvsinha @gdetor Can any of you two edit this submission in machine learning?

gdetor commented 1 year ago

@rougier I can handle this.

rougier commented 1 year ago

@gdetor Thanks!

tuelwer commented 8 months ago

@gdetor thank you for agreeing to handle this submission! Is there anything we can do to move this submission forward?

gdetor commented 8 months ago

@tuelwer Sorry for the delay.

Hi @ogrisel and @benureau Could you please review this submission?

rougier commented 7 months ago

Any update?

gdetor commented 6 months ago

Dear reviewers @ReScience/reviewers Could anybody review this submission?

mo-arvan commented 6 months ago

I'd be interested in reviewing this submission, but I have to mention, I doubt I can rerun all the experiments due to computational constraints.

rougier commented 6 months ago

@mo-arvan Thanks and I think not re-doing everything is fine. @gdetor What do you think?

gdetor commented 6 months ago

@rougier @mo-arvan I'm OK with it.

mo-arvan commented 6 months ago

Okay, I will review this work by the end of July.

mo-arvan commented 3 months ago

I apologize, but I have not been able to review this submission yet, should be able to write the review within the next few weeks.

rougier commented 2 months ago

Thanks. Any progress?

gdetor commented 1 month ago

@mo-arvan gentle reminder

mo-arvan commented 1 month ago

In this paper, Wagner et al. provide a reproduction report of Müller et al.'s work on label smoothing. They begin with a concise introduction to the original study and the motivations behind it. The authors then present essential details regarding the models and datasets used, noting specific variations driven by limited computational resources.

The authors have done an excellent job of providing documentation and instructions for using their released code. Their repository includes multiple Jupyter notebooks detailing the conducted experiments, along with specified dependency requirements to facilitate the setup process. To further simplify future installations, I created a Docker container as part of the review process. The files and instructions are available in my forked repository.

In their initial results, the authors examine the effect of label smoothing on model accuracy. While Müller et al. claimed that label smoothing positively impacts the test accuracy of trained models, Wagner et al. suggest that it enhances accuracy by reducing overfitting—a claim not made by the original authors. However, their results indicate mixed effects; out of eight experiments, three showed higher accuracy without label smoothing. Upon reviewing their code (https://github.com/sdwagner/re-labelsmoothing/blob/fb6c3634d2049ef7f175e7a992f109c43680fae3/datasets/datasets.py), it appears that they do not load the test set, raising the possibility that the reported results are based on the validation set. Unlike the original study, this reproduction does not include confidence intervals, and the small differences in accuracy could be attributed to randomness in the training process. Adding uncertainty analysis would significantly strengthen this work.

In the next section, the authors reproduce the results of a visualization experiment from the original study that demonstrates the effect of label smoothing on the activations of the penultimate layer and the network output. Figure 2 in their work aligns with the findings of the original study, although there is a minor discrepancy in the order of the columns in the visualization.

The authors then investigate the impact of label smoothing on Expected Calibration Error (ECE). With the exception of the results from the MNIST dataset using a fully connected network, their findings generally align with those of the original study. The reported results for training a transformer model for translation are mixed, with not all findings matching the original study. Similar to the accuracy results, the authors report findings based on the validation set, which may account for some discrepancies.

Finally, the results of the distillation experiments on fully connected networks for MNIST are consistent with the original study, though there is a slight increase in error. Ultimately, the authors confirm the observation made by Müller et al. regarding accuracy degradation in students when the teacher is trained with label smoothing. Figure 7 and 8 lack the confidence intervals present in the original study, which would have been beneficial for comparison.

Minor editing suggestions: "The authors state, that the logit dependents on the Euclidean distance" -> "The authors state that the logit depends on the Euclidean distance" "The evaluation was performed using the ECE" -> ECE should be spelled out on first use.

gdetor commented 1 month ago

@mo-arvan Thank you for your report. @tuelwer @sdwagner Could you please respond to the reviewer's comments?

tuelwer commented 1 month ago

@mo-arvan Thank you for reviewing our submission and you thoughtful and detailed comments! @gdetor We will update our submission in the next days to incorporate the reviewer's comments.

tuelwer commented 1 month ago

@mo-arvan Thanks for creating a dockerfile! Feel free to open a PR to integrate it into our repository 😊

mo-arvan commented 1 month ago

Glad you find it useful. Sure, I'll submit a pull request. I'd be happy to engage in a discussion as well.

One last minor comment, your use of vector graphics in your figures is a step up from the original publication, I'd suggest changing the color palette and the patterns to further improve the presentation of the figures, e.g. Figure 3 (b).

tuelwer commented 1 month ago

@mo-arvan Thanks again for your detailed comments! In the following we want to address each of the points that you raised:

Confusion validation and test data: We carefully double-checked our datasets and can confirm that all experiments were performed on the test split of each dataset:
- For the datasets implemented by PyTorch, we set train=False, which corresponds to the test split (please refer to, e.g., here).
- For the CUB-200-2011 data we use the test split of the dataset which is defined in the file train_test_split.txt. The CUB-200-2011 dataset does not have a validation set.
- For the Tiny ImageNet we use the split that is defined as validation split. We assume that the authors of the paper did this as well, since the test data split of the Tiny ImageNet dataset is not labeled. We apologize for the confusion, and we have refactored the code accordingly.
Uncertainty quantification: We added confidence intervals for Figure 6 and 7.
Color palette: We have chosen the colors that were used in the original work to allow easy comparison of the experimental results.
Edits: We have incorporated the proposed changes into our report.

gdetor commented 1 month ago

@mo-arvan Please let me know if you agree with the responses so I can publish the paper. Thank you.

mo-arvan commented 4 weeks ago

Yes, the response addresses my main concerns. I was wrong about the validation/test splits.

ReScience / submissions

[Re] When Does Label Smoothing Help? #75