Open wangdong0556 opened 1 year ago
Look Table 1 (6.6M) and Table 2 (7.7M) .
There is a patch embedding layer in the middle. The input size of COCO is 256 192, while that of MPII is 256 256. The CNN backbone downsamples them to 64 48 and 64 64, respectively. Thus, we need to use different patch sizes (4 3 for COCO and 4 4 for MPII) to get the same number of visual tokens (i.e., 16 * 16).
Thank you. The parameters of the network should only be related to the size of the network model, not to the input size. Do you use different sizes of CNN on your network?
For transformer models, actually, the number of parameters can be related to the input size if we use different patch sizes. Besides, if you design a fully forward network, the number of parameters is also related to the input size. Thus, your comment "The parameters of the network should only be related to the size of the network model, not to the input size." may not be accurate.
Moreover, I follow all hyperparameters in TokenPose, rather than design the CNN by myself. The CNN backbone is different for TokenPose-small, TokenPose-Base, and TokenPose-Large. Please refer https://github.com/leeyegy/TokenPose
Thanks. I see that TransPose is consistent on different input size ( https://github.com/yangsenius/TransPose ). I see that TransPose is consistent. Is there a difference between the two? In addition, is the parameter of SimpleBaseline-Res152 in Table 2 a clerical error?
The TokenPose has a linear projection layer to make different input produce the same number of tokens (i.e., 16 * 16). In contrast, the TransPose does not have it, which means different input size leads to different number of tokens.
As for the parameter of SimpleBaseline-Res152, it is a typo and it should be 68.3M in Table 2. Sorry for the inconvenience.
Why are the model parameters of PPT-s different on COCO data set and MPII?