Batch, Sizes and Data pipeline

Hello, it might be a silly question, but after a while I could not figure out what is wrong with my reading.

(QUESTION 1) In model.model.py it is commented that the batch goes from (b, C, H, W) ---> (2b, C, H, W) after concatenating image and sketches.

Later on, the batch increases up to 4b after self-attention (see image).

However, a quick unitary test reveals that the self attention module does not modify the batch: Outputs: torch.Size([3, 197, 768]) [196, ..., 196] [None, ..., None]

I suspect that I do not understand well how the positive / negative pairs are being passed to the model, and the scarce comments on the code can be a bit cryptic.

(QUESTION 2) Therefore, my second question is, given the pair (sk, im) how are possitive and negatives defined?

I think it is not entirely clear after inspection of the triplet loss function:

(QUESTION 3) I assume the following line is aggregating local information from adjacent tokens: Is this commented on the paper? Can't read it in the Relational Network section rather than only mentioning the MLP-Relu concatenation.

Thanks for your attention, and keep it up with the good work!

buptLinfy / ZSE-SBIR

Batch, Sizes and Data pipeline #11