Bug in pseudo-labelling code?

linzhiqiu commented 2 years ago

I am confused by the implementation of pseudo-labelling in this library (lib/algs/pseudo_label.py). Especially, the forward() has:

y_probs = y.softmax(1)
onehot_label = self.__make_one_hot(y_probs.max(1)[1]).float()
gt_mask = (y_probs > self.th).float()
gt_mask = gt_mask.max(1)[0] # reduce_any
lt_mask = 1 - gt_mask # logical not
p_target = gt_mask[:,None] * 10 * onehot_label + lt_mask[:,None] * y_probs

output = model(x)
loss = (-(p_target.detach() * F.log_softmax(output, 1)).sum(1)*mask).mean()
return loss

I am confused why when computing p_target, the gt_mask is multiplied by 10? What is meaning of 10 here?

Also, I believe the lt_mask means the examples with max probability smaller than threshold and thus should be ignored when computing the loss. However, the p_target has the + lt_mask[:,None] * y_probs.

This seems to be different from what is described in the paper. If you are implementing a variant of pseudo-labelling loss function, could you point me to that paper?

linzhiqiu commented 2 years ago

I am also confused by the coef in training_hierarchy.py:

coef = args.consis_coef * math.exp(-5 * (1 - min(iteration/args.warmup, 1))**2)

This coefficient does not appear in the original paper.

linzhiqiu commented 2 years ago

One more question: For self training, it seems both labeled and unlabeled data are used for the KL divergence between teacher and student? The original paper says only the unlabeled data is used to compute the KLD.

jongchyisu commented 2 years ago

Hello, for the first question, the code for pseudo-label is from this PyTorch repo which is a re-implementation from this official Tensorflow implementation from Google. From their comment: Multiplying the one-hot pseudo_labels by 10 makes them look like logits.

As for the lt_mask, all of the papers from Google (Oliver et al., FixMatch, etc) use the same codebase but they did not specify the loss functions. You are right that the lt_mask is an extra term for pseudo-labeling that I should include in the paper. Since the output_teacher and output_student are the same when not using pseudo-labels, the extra term becomes the entropy of the predictions.

As for the coef, this is for warmup scheduling following Oliver et al.

About self-training: Thanks for pointing this out. It is a typo in the paper, I did indeed use both labeled and unlabeled data for self-training.

linzhiqiu commented 2 years ago

Thanks for the helpful response! Could you point me to any paper that uses this specific variant of this pseudo-labelling loss?

cvl-umass / ssl-evaluation

Bug in pseudo-labelling code? #3