Open RichardChen20 opened 3 years ago
Hi @SeanChen0220 , thanks for your interest in the paper and the work. Yes, the dimension of projections z is m x 2. We assume the projections generated by the PecLR during pretraining are m key points in an arbitrary 2D space. The PeCLR loss function ensures that the 2D affine transform on the image correlates to a similar affine transform on these m key points in the projection space. I hope this clarifies why the last dimension was 2. 'm' is the number of key points in the projection space. It is 128/2 = 64 in our pretrained models.
I hope the answer helps :)
Thank you so much for your reply! Now I understand why the last dim is 2. Furthermore, I wonder if you thought this setting: the output feature map of ResNet50's 'layer4' is CxHxW(maybe 2048x8x6), usually followed by an avgpool layer. Have you tried to transform the feature map directly rather than transform the projection z? That also seems to be a reasonable equivariance representation learning and might provide more detailed feature comparision other than comparing some coordinates of keypoints.
This is an interesting idea. We have not tried it yet and we encourage you to go ahead and let us know what result you achieved. Note that there is evidence in the self-supervised literature (see SimCLRv2) that having a multi-layered projection layer is beneficial for the downstream performance. Hence it may not work as well.
Thank you for your relpy! I'm trying some related experiments to evaluate whether this setting (feature map CL) can work. May I ask some details about your experiments for my reference? I'm training resnet on person images from COCO dataset and the mean positive value is close to 1, however the mean negtive value is close to 0.5. I remember in moco, neg value will be close to 0. So I wonder if mean neg value in your equivariance hand cl experiment also close to 0 or another value?
Hi, your work is so interesting! However, I'm not very clear about the inverse transformations in your paper and codes. I noticed that in your ICCV paper, Eq.3, the dim of 'zi' is mx2, may I ask why the last dim is 2? Or the encoding zi is not a feature vector but the xy coordinates of some important keypoints? And what 'm' means is not very clear in your paper. I hope you could reply!