why out_l = torch.cat((out_l_1, out_l), dim=1) ?

wpumain commented 1 year ago

https://github.com/BIT-MJY/SeqOT/blob/d940882de7ff49dfaec5c1d684c4d4e02da82ed1/modules/seqTransformerCat.py#L73

out_l = torch.cat((out_l_1, out_l), dim=1)

This operation, out_l = torch.cat((out_l_1, out_l), dim=1), is actually using a concept similar to ResNet's skip connections, where features from previous layers are combined with features from later layers.Right？

Why not directly perform element-wise addition, which would also maintain the data dimensionality?

Why isn't there a similar skip connection for the output of transformer_encoder2at the same time?

BIT-MJY commented 1 year ago

Why not directly perform element-wise addition, which would also maintain the data dimensionality?

Actually, we explicitly concatenate the features along the channel to increase the embedding dimension of the sentence-like input. We think this operation can improve the distinguishability of spatial features and the experimental results also support this.

Why isn't there a similar skip connection for the output of transformer_encoder2 at the same time?

Nice question. We have chosen not to use the concat-based skip connection in order to optimize running efficiency. However, our experiments have shown that the addition-based skip connection actually results in worse place recognition performance, and this warrants further analysis. It's possible that the triplet loss function provides a relatively "soft" constraint for training the place recognition network. Although the addition-based skip connection can accelerate the reduction of triplet loss values during training, it may not lead to a corresponding increase in recall on the test set.

Btw, thanks for your such interest in our work, and you can follow our latest place recognition work CVTNet which can provide even better recognition results.

wpumain commented 1 year ago

Thank you for your help with the features. I have learned a lot from your excellent SeqOT. Can I ask you three more questions? 1.The NetVlad you used is not the original NetVlad method, right? Because you did not perform clustering when calculating NetVlad. What is the basis for your calculation? 2.In general VPR tasks, after backbone processing, when performing feature aggregation, the dimension of C is usually increased or maintained. In SeqOT, before performing NetVlad feature aggregation, the tensor dimension is (7,512,2700,1). In NetVlad, the dimension is directly reduced from (7,32768=512*64) to (7,256). That is, after feature fusion, the dimension of C does not increase, but decreases (from 512 to 256), which is different from the general VPR task. Intuitively, the higher the dimension, the more accurate the information represented, and the easier it is to achieve SOTA. At the same time, after feature fusion, a feature vector should increase the dimension of the fused feature vector to better represent the information of multiple feature vectors before fusion with fewer feature vectors. However, you achieved SOTA by doing this. How should this be understood? 3.How was [Fig. 5: The t-SNE visualization of place clustering] generated? Where did the data for OT, Output of MSM, and SeqOT come from?

wpumain commented 1 year ago

The NetVlad aggregation method you are using is more convenient than the original NetVlad algorithm because it does not require caching of features extracted from the model backbone, and therefore does not require clustering operations based on these features. However, what is the mathematical basis for your approach?

BIT-MJY / SeqOT

why out_l = torch.cat((out_l_1, out_l), dim=1) ? #11