Loss Implementation - Githubissues

I'm a little confused by the exact loss used in the experiments. From #100 it is clear that you only use the projection layers which is basically Linear > ReLU > Linear > Normalise. So MLP loss is not used. Then in the loss function, assuming world_size == 1 we end up with the calculation:

logits_per_audio = logit_scale_a * audio_features @ text_features.T logits_per_text = logit_scale_a * text_features @ audio_features.T

However, in the paper, after equation 3 you say:

Where τ is a learnable temperature parameter for scaling the loss. Two logarithmic terms consider either audio-to-text logits or text-to-audio logits.

Only the audio-to-text logit parameter is used in the code if you don't use the mlp layer loss. Should the logits_per_text actually use the logit_scale_t parameter instead of logit_scale_a parameter? Currently only the mlp_loss seems to use the logit_scale_t parameter, but you say that isn't used in #100. The paper says both logit_scale parameters are used but that doesn't seem to line up with the code. Then there is the weighted_loss as well which I'm not sure is used or not.

Could you please clarify?

Thank you.

LAION-AI / CLAP

Loss Implementation #109