Hello. Thanks you for your great work!
I had a question about the full-resolution Swin-T baseline given in the FastVQA paper. It is mentioned that fixed recognition features were regressed to get the baseline. Does this mean all frames of the video (no temporal sampling) and no fragmentation or resizing was done? Or was the temporally sampled video the input to the Swin-T model for generating the fixed features?
Hello. Thanks you for your great work! I had a question about the full-resolution Swin-T baseline given in the FastVQA paper. It is mentioned that fixed recognition features were regressed to get the baseline. Does this mean all frames of the video (no temporal sampling) and no fragmentation or resizing was done? Or was the temporally sampled video the input to the Swin-T model for generating the fixed features?