HowieMa / PPT

[ECCV 2022] "PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation"
55 stars 1 forks source link

about #Params #8

Open wangdong0556 opened 1 year ago

wangdong0556 commented 1 year ago

Why are the model parameters of PPT-s different on COCO data set and MPII?

wangdong0556 commented 1 year ago

Look Table 1 (6.6M) and Table 2 (7.7M) .

HowieMa commented 1 year ago

There is a patch embedding layer in the middle. The input size of COCO is 256 192, while that of MPII is 256 256. The CNN backbone downsamples them to 64 48 and 64 64, respectively. Thus, we need to use different patch sizes (4 3 for COCO and 4 4 for MPII) to get the same number of visual tokens (i.e., 16 * 16).

wangdong0556 commented 1 year ago

Thank you. The parameters of the network should only be related to the size of the network model, not to the input size. Do you use different sizes of CNN on your network?

HowieMa commented 1 year ago

For transformer models, actually, the number of parameters can be related to the input size if we use different patch sizes. Besides, if you design a fully forward network, the number of parameters is also related to the input size. Thus, your comment "The parameters of the network should only be related to the size of the network model, not to the input size." may not be accurate.

Moreover, I follow all hyperparameters in TokenPose, rather than design the CNN by myself. The CNN backbone is different for TokenPose-small, TokenPose-Base, and TokenPose-Large. Please refer https://github.com/leeyegy/TokenPose

wangdong0556 commented 1 year ago

Thanks. I see that TransPose is consistent on different input size ( https://github.com/yangsenius/TransPose ). I see that TransPose is consistent. Is there a difference between the two? In addition, is the parameter of SimpleBaseline-Res152 in Table 2 a clerical error?

HowieMa commented 10 months ago

The TokenPose has a linear projection layer to make different input produce the same number of tokens (i.e., 16 * 16). In contrast, the TransPose does not have it, which means different input size leads to different number of tokens.

As for the parameter of SimpleBaseline-Res152, it is a typo and it should be 68.3M in Table 2. Sorry for the inconvenience.