Thank you for your great job! I'm curious about the inbetween task and motion prediction task and encountered some issues.
I found in your code that the inbetween task and prediction task are performed on token level. For the inbetween task, you fix the first and last 25% tokens. And for the prediction task, you use the first 20% tokens. However, the encoder used to generate tokens uses convolution operations, which means in both tasks the tokens you retained might involve a wider range of frames in the original sequence. Therefore, the comparison might be unfair. I wonder whether I misunderstand or there are some mistakes.
Thank you for your great job! I'm curious about the inbetween task and motion prediction task and encountered some issues.
I found in your code that the inbetween task and prediction task are performed on token level. For the inbetween task, you fix the first and last 25% tokens. And for the prediction task, you use the first 20% tokens. However, the encoder used to generate tokens uses convolution operations, which means in both tasks the tokens you retained might involve a wider range of frames in the original sequence. Therefore, the comparison might be unfair. I wonder whether I misunderstand or there are some mistakes.
Thanks again!