OpenGVLab / unmasked_teacher

[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
https://arxiv.org/abs/2303.16058
MIT License
285 stars 15 forks source link

What loss is used during stage-1 pre-training - "l2" or MSE loss #25

Closed atawari closed 10 months ago

atawari commented 10 months ago

In the github scripts, it uses "l2" loss which is a cosine alignment loss but in the paper, it is mentioned that it uses MSE loss. I am curious which is right?

Excerpt from the paper:

"... We select the corresponding unmasked token from the student and teacher, and compute the mean squared error (MSE) between the normalized pairs..."

From the scripts it uses clip_loss_type = "l2" and clip_norm_type="l2"

if clip_loss_type == 'mse':
    loss_func_clip = nn.MSELoss()
elif clip_loss_type == 'smooth_l1':
    loss_func_clip = nn.SmoothL1Loss()
.
.
.

if clip_loss_type == 'l2':
    loss_clip = (2 - 2 * (outputs_clip * targets_clip).sum(dim=-1)).mean()
elif clip_loss_type in ['mse', 'smooth_l1']:
    loss_clip = loss_func_clip(input=outputs_clip, target=targets_clip)
loss_func_clip = 

The above code uses cosine distance and not MSE. What is the loss that is used for the best stage-1 pre-trained checkpoint?

Andy1621 commented 10 months ago

Thanks for your good question. Since the output has been processed by L2 normalization, the MSE loss is equal to cosine distance, which was followed by MILAN. I do not change the name seriously is the code, so it may be a little misleading.

atawari commented 10 months ago

Make sense! Thank you.

PS: I ran into Nan loss when used MSE loss function and fp16 training, so "l2" is numerically stable too.

Andy1621 commented 10 months ago

Great! For NAN, maybe you can try bf16, which is more stable.