I noticed that during self-training, the output dimensions of MoCo are set to 128, and the InfoNCE loss is calculated based on the 128-dimension features. However, when training the linear head, the fully connected output layer is concatenated with the 2048-dimensional features. In my opinion, if the 128-dimensional data represents latent features, it would be better to concatenate the classification head with the 128-dimensional output instead.
So may I ask what the reason is for using this implementation in the code?
I noticed that during self-training, the output dimensions of MoCo are set to 128, and the InfoNCE loss is calculated based on the 128-dimension features. However, when training the linear head, the fully connected output layer is concatenated with the 2048-dimensional features. In my opinion, if the 128-dimensional data represents latent features, it would be better to concatenate the classification head with the 128-dimensional output instead.
So may I ask what the reason is for using this implementation in the code?