Questions regarding the Loss function

Dear authors,

I am currently working on reproducing the results from your paper. It doesn't seem like you haven't included any code regarding the implementation of your loss function, and I therefore have some questions on the matter.

From my understanding of the loss, you have modified it in order to account for the dynamical queue (additional gps embeddings). $P$ - corresponds to the different views from an image of a given batch, lets take it as being 1 view for simplicity. $V$ - is the embedded image $L$ - is the embedded GPS coordinate

This simplifies the Loss for a single view of a single image in a batch to the following:

$$L_i = - \log \frac{ \exp(V_i \cdot Li / \tau)}{\sum{i = 0} \exp(V_i \cdot Li / \tau) + \sum{i = 0} \exp(V_i \cdot \tilde{L}_i / \tau)}$$

Where in the denominator, the first sum is for a batch of length B, and the second sum is for the dynamic queue of length S.

My questions are the following:

1) It seems like you are using the same index $i$ for both the $i^{th}$ sample of a batch, the sum over the batch, and the sum over the dynamical queue. Did you mean to take something like the loss below (index $i$ changed to $k$ in the denominator)?

$$L_i = - \log \frac{ \exp(V_i \cdot Li / \tau)}{\sum{k = 0} \exp(V_i \cdot Lk / \tau) + \sum{k = 0} \exp(V_i \cdot \tilde{L}_k / \tau)}$$

By doing so, you do contrastive learning of each image over all other coordinates while keeping the same image $V_i$ in the denominator.

2) If it is true that you do contrastive learning of each image over all other coordinates, why did you decide not to do contrastive learning of each GPS coordinate over all other images? In fact in the original CLIP paper, the Cross Entropy Loss is utilized both horizontally and vertically, yet you have chosen only to use it horizontally. Is there a specific reason for this decision?
3) Going back to the $P$ augmented views, you mention in your paper that a benefit of using a frozen CLIP backbone is that one can pre-encode all images, making the training process faster. Yet if you perform $P$ augmentations for each image and for each batch, didn't you have to re-encode the augmented images again, thus not being able to take advantage from this benefit?

I look forward to hearing from you! Thanks.

Hi Oskar,

Thank you for your interest in our work. Regarding your questions:

i. It seems like you are using the same index $i$ for both the $i^{\text {th }}$ sample of a batch, the sum over the batch, and the sum over the dynamical queue.

Yes, you are correct. As you mentioned, we do contrastive learning by comparing the encodings of each image to those of all other coordinates, while keeping the image being compared constant (i.e. we apply the Cross Entropy Loss independently for each row in the similarity matrix). More concretely, for a given batch of images and GPS coordinates, the code to calculate the loss after obtaining the features of the images and the GPS coordinates of the batch and the queue would be as follows:

import torch
from torch import nn
import torch.nn.functional as F

BATCH_SIZE = 256
QUEUE_SIZE = 4096
FEATURE_DIM = 512
TEMPERATURE = torch.rand([])

image_embeddings = torch.rand(BATCH_SIZE, FEATURE_DIM)
gps_embeddings = torch.rand(BATCH_SIZE, FEATURE_DIM)
gps_queue_embeddings = torch.rand(QUEUE_SIZE, FEATURE_DIM)

# Criterion & Targets
criterion = nn.CrossEntropyLoss()
targets_img_gps = torch.Tensor([i for i in range(BATCH_SIZE)]).long()

# for (img_batch, gps_batch) in epoch:

#  ... forward pass & queue update ...

# Normalize the Embeddings
image_embeddings = F.normalize(image_embeddings, dim=1)
gps_embeddings = F.normalize(gps_embeddings, dim=1)
gps_queue_embeddings = F.normalize(gps_queue_embeddings, dim=1)

# Append GPS Queue
gps_embeddings_all = torch.cat([gps_embeddings, gps_queue_embeddings], dim=0)

# Get the temperature
temp = TEMPERATURE.exp()

# Compute the logits
logits_img_gps = temp * (image_embeddings @ gps_embeddings_all.T)

# Compute the loss
img_gps_loss = criterion(logits_img_gps, targets_img_gps)

print(logits_img_gps.shape) # (BATCH_SIZE, BATCH_SIZE + QUEUE_SIZE)
print(img_gps_loss)

ii. If it is true that you do contrastive learning of each image over all other coordinates, why did you decide not to do contrastive learning of each GPS coordinate over all other images? In fact in the original CLIP paper, the Cross Entropy Loss is utilized both horizontally and vertically, yet you have chosen only to use it horizontally. Is there a specific reason for this decision?

That's a good observation. In fact, we originally considered and implemented this idea during the early stages of the project. From what we observed, this minor modification did not provide any improvements compared to its only horizontal counterpart. On top of that, given that adding a queue of GPS coordinates significantly improved GeoCLIP's performance, applying the vertical loss would have complicated the loss function since there are no positive samples for GPS coordinates in the queue. Thus, we decided not to include it in the final method.

iii. Going back to the $P$ augmented views, you mention in your paper that a benefit of using a frozen CLIP backbone is that one can pre-encode all images, making the training process faster. Yet if you perform $P$ augmentations for each image and for each batch, didn't you have to re-encode the augmented images again, thus not being able to take advantage from this benefit?

For each image in our training set, we didn't only pre-encode a single embedding for each image, but we pre-encoded $n$ augmentations of it (with $n = 10$ in our particular case). Then, during training, we would sample a subset of these augmentations for each corresponding image.

Please, let us know if you have any more questions.

VicenteVivan / geo-clip

Questions regarding the Loss function #6