Closed Adamdad closed 11 months ago
I believe the second norm is necessary; note that class_embedding
is the mean of multiple class_embeddings
and isn't guaranteed to be normalized.
Indeed, after averaging normalization is not guaranteed anymore, so this is still needed.
The zeroshot_classification.py script includes code (https://github.com/LAION-AI/CLIP_benchmark/blob/main/clip_benchmark/metrics/zeroshot_classification.py#L50) that performs normalization of the text embedding twice. Specifically, the F.normalize function from PyTorch is called to normalize the text embedding along the last dimension, and the resulting tensor is then averaged along the first dimension to obtain a single embedding vector. However, the code then repeats the normalization step on this single embedding vector using class_embedding.norm(). This second normalization appears to be redundant and can be safely removed.