Curiosity about Model Choice: Swin-based vs. ViTPose with PCT

Hello @Gengzigang and team,

The idea of representing human pose as compositional tokens (PCT) is both unique and compelling. By modeling the relationship between keypoints in such a structured manner, it's pretty inspiring.

However, I have a question regarding your model choice. I noticed that you opted for a Swin-based model for implementation. Given the current success and traction of ViTPose, I'm curious as to why you didn't choose to integrate PCT directly with ViTPose. Was there a specific reason or advantage for preferring the Swin-based model over ViTPose when incorporating PCT?

Thank you for taking the time to answer. I'm eager to delve deeper into your work and truly appreciate the effort you've put into this research. Looking forward to your insights!

Warm regards, Jia-Yau

Gengzigang / PCT

Curiosity about Model Choice: Swin-based vs. ViTPose with PCT #19