Open linzhiqiu opened 2 years ago
I am also confused by the coef
in training_hierarchy.py:
coef = args.consis_coef * math.exp(-5 * (1 - min(iteration/args.warmup, 1))**2)
This coefficient does not appear in the original paper.
One more question: For self training, it seems both labeled and unlabeled data are used for the KL divergence between teacher and student? The original paper says only the unlabeled data is used to compute the KLD.
Hello, for the first question, the code for pseudo-label is from this PyTorch repo which is a re-implementation from this official Tensorflow implementation from Google. From their comment: Multiplying the one-hot pseudo_labels by 10 makes them look like logits.
As for the lt_mask
, all of the papers from Google (Oliver et al., FixMatch, etc) use the same codebase but they did not specify the loss functions. You are right that the lt_mask is an extra term for pseudo-labeling that I should include in the paper. Since the output_teacher
and output_student
are the same when not using pseudo-labels, the extra term becomes the entropy of the predictions.
As for the coef
, this is for warmup scheduling following Oliver et al.
About self-training: Thanks for pointing this out. It is a typo in the paper, I did indeed use both labeled and unlabeled data for self-training.
Thanks for the helpful response! Could you point me to any paper that uses this specific variant of this pseudo-labelling loss?
I am confused by the implementation of pseudo-labelling in this library (lib/algs/pseudo_label.py). Especially, the forward() has:
I am confused why when computing
p_target
, thegt_mask
is multiplied by10
? What is meaning of10
here?Also, I believe the
lt_mask
means the examples with max probability smaller than threshold and thus should be ignored when computing the loss. However, thep_target
has the+ lt_mask[:,None] * y_probs
.This seems to be different from what is described in the paper. If you are implementing a variant of pseudo-labelling loss function, could you point me to that paper?