Closed weiyengs closed 4 years ago
Depending on the length of the video, I guess temporally averaging the descriptors is a good choice for short video clip (ie less than 100). Otherwise you will need to aggregate the features using neural network approaches such as NetVLAD for example.
Depending on the length of the video, I guess temporally averaging the descriptors is a good choice for short video clip (ie less than 100). Otherwise you will need to aggregate the features using neural network approaches such as NetVLAD for example.
Hey thanks for the idea!
Hi, great work there! I noticed that the dimension of the output is not fixed and depends on video length(eg. some are 33x2048, some are 12x2048, etc)
What's the best way to get them to a single dim(eg. 1x2048)?
Thanks!