dahiyaaneesh / peclr

This is the pretraining code for PeCLR. An equivariant contrastive learning framework for 3D hand pose estimation. The paper is presented at ICCV 2021.
https://ait.ethz.ch/projects/2021/PeCLR/
MIT License
83 stars 14 forks source link

Questions about the dim of encodings #1

Open RichardChen20 opened 2 years ago

RichardChen20 commented 2 years ago

Hi, your work is so interesting! However, I'm not very clear about the inverse transformations in your paper and codes. I noticed that in your ICCV paper, Eq.3, the dim of 'zi' is mx2, may I ask why the last dim is 2? Or the encoding zi is not a feature vector but the xy coordinates of some important keypoints? And what 'm' means is not very clear in your paper. I hope you could reply!

dahiyaaneesh commented 2 years ago

Hi @SeanChen0220 , thanks for your interest in the paper and the work. Yes, the dimension of projections z is m x 2. We assume the projections generated by the PecLR during pretraining are m key points in an arbitrary 2D space. The PeCLR loss function ensures that the 2D affine transform on the image correlates to a similar affine transform on these m key points in the projection space. I hope this clarifies why the last dimension was 2. 'm' is the number of key points in the projection space. It is 128/2 = 64 in our pretrained models.

I hope the answer helps :)

RichardChen20 commented 2 years ago

Thank you so much for your reply! Now I understand why the last dim is 2. Furthermore, I wonder if you thought this setting: the output feature map of ResNet50's 'layer4' is CxHxW(maybe 2048x8x6), usually followed by an avgpool layer. Have you tried to transform the feature map directly rather than transform the projection z? That also seems to be a reasonable equivariance representation learning and might provide more detailed feature comparision other than comparing some coordinates of keypoints.

spurra commented 2 years ago

This is an interesting idea. We have not tried it yet and we encourage you to go ahead and let us know what result you achieved. Note that there is evidence in the self-supervised literature (see SimCLRv2) that having a multi-layered projection layer is beneficial for the downstream performance. Hence it may not work as well.

RichardChen20 commented 2 years ago

Thank you for your relpy! I'm trying some related experiments to evaluate whether this setting (feature map CL) can work. May I ask some details about your experiments for my reference? I'm training resnet on person images from COCO dataset and the mean positive value is close to 1, however the mean negtive value is close to 0.5. I remember in moco, neg value will be close to 0. So I wonder if mean neg value in your equivariance hand cl experiment also close to 0 or another value?