Why not use original pseudo-label-implement as baseline?

brain-research / realistic-ssl-evaluation

Open source release of the evaluation benchmark suite described in "Realistic Evaluation of Deep Semi-Supervised Learning Algorithms"

Apache License 2.0

460 stars 98 forks source link

Why not use original pseudo-label-implement as baseline? #26

Closed CheukNgai closed 5 years ago

CheukNgai commented 5 years ago

@avital In the Pseudo Label original paper, it says "we just pick up the class which has maximum predicted probability for each unlabeled sample" and the loss is defined as the weight average of the label loss and the unlabel loss.

Could you please say something about why this code doesn't use what original paper claims as baseline but uses another implementation with "teacher-student-like" loss and pick the label when confident is greater than a threshold as baseline.

craffel commented 5 years ago

"we just pick up the class which has maximum predicted probability for each unlabeled sample"

This is exactly what our code does, except for the thresholding which is typical in implementations of pseudo-label and in other studies of "self-training" (which pseudo-label is a rebranding of). Note that our code is equivalent to "pick the class with the maximum probability" if the threshold is set to 0.

Regarding the student-teacher bit, I think you are just confused about how the algorithm is implemented given that we implement everything as "Guess a label (the "teacher") and use that for training the network (the "student")". Here the label guess is the "pseudo-label".

CheukNgai commented 5 years ago

Regarding the student-teacher bit, I think you are just confused about how the algorithm is implemented given that we implement everything as "Guess a label (the "teacher") and use that for training the network (the "student")". Here the label guess is the "pseudo-label".

Thank you for your reply but i still have a problem about the loss definition. In this repo's implementation, the consistency loss is defined as the loss of the "consistency of the teacher's label and the label of the student". In original pseudo-label paper, the Loss is defined as the weighted average of the unlabeled loss and the labeled loss while this repo's implementation you use the weighted average of the labeled data loss and the pseudo label data loss(i.e. using reverse_kl). Why does this repo not use the same loss func on the labeled data and unlabeled data but use reverse_kl independently for the pseudo label? Is it beneficial for improving the accuracy?

craffel commented 5 years ago

Hi, there is no difference between the two loss functions. In our codebase, reverse KL is equivalent to cross-entropy for the unlabeled data. Both loss terms are cross_entropy(prediction, label) + alpha(t)*cross_entropy(prediction, pseudo_label) where the first term is computed over labeled datapoints and the second is computed over unlabeled datapoints.