LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.43k stars 138 forks source link

Loss Implementation #109

Open MBAnslow opened 1 year ago

MBAnslow commented 1 year ago

I'm a little confused by the exact loss used in the experiments. From #100 it is clear that you only use the projection layers which is basically Linear > ReLU > Linear > Normalise. So MLP loss is not used. Then in the loss function, assuming world_size == 1 we end up with the calculation:

logits_per_audio = logit_scale_a * audio_features @ text_features.T logits_per_text = logit_scale_a * text_features @ audio_features.T

However, in the paper, after equation 3 you say:

Where τ is a learnable temperature parameter for scaling the loss. Two logarithmic terms consider either audio-to-text logits or text-to-audio logits.

Only the audio-to-text logit parameter is used in the code if you don't use the mlp layer loss. Should the logits_per_text actually use the logit_scale_t parameter instead of logit_scale_a parameter? Currently only the mlp_loss seems to use the logit_scale_t parameter, but you say that isn't used in #100. The paper says both logit_scale parameters are used but that doesn't seem to line up with the code. Then there is the weighted_loss as well which I'm not sure is used or not.

Could you please clarify?

Thank you.

lukewys commented 1 year ago

Hi, indeed, as you found out, the two temperature parameters are identical in our implementation. We follow such implementation from the CLIP model. Please see: https://github.com/mlfoundations/open_clip/blob/main/src/training/train.py#L274 . In our experiment, we tried to use two temperature parameters in MLP case, but adding MLP did not work very well.

Best,