Closed amaralibey closed 4 years ago
You're right. We don't calculate the gradient of features from old batches (anchors.transpose()
), but only back-propagate the gradient to current features (M.feats
).
sim = torch.matmul(anchors.transpose(), M.feats)
I have a question about the algorithm01 in your paper. You mentioned that the embeddings (called anchors) are detached from the graphe before being put into the queue. You know that detached nodes from the graph don't retain the gradient history, hence, the embeddings will be considered constants in the next iteration, making them irrelevant to the loss function. I don't undestand how the gradient can still be calculated on old batches without retaining the gradient history in different stages of the forward pass ?
Thank you.