Thank you for your code. I hope to use your network in our work. Yet I am not quite clear how the network is finetuned for Action Recognition task in your paper. It would be very helpful if you could clear my confusion.
To predict the action of a clip, how did you combine the features of all the clip's frames?
For example, did you use the features of the last conv layer and concatenate the features of all frames then feed them to FC layers? Or did you sample only one frame from a clip and classify actions based on the selected frame?
Thanks! I saw you had answered similar questions. Yet I am still confused.
Hi Hsin-Ying,
Thank you for your code. I hope to use your network in our work. Yet I am not quite clear how the network is finetuned for Action Recognition task in your paper. It would be very helpful if you could clear my confusion.
To predict the action of a clip, how did you combine the features of all the clip's frames? For example, did you use the features of the last conv layer and concatenate the features of all frames then feed them to FC layers? Or did you sample only one frame from a clip and classify actions based on the selected frame?
Thanks! I saw you had answered similar questions. Yet I am still confused.