thanks for your excellent work, and i have one question about the extracting 2d appearance feature. when using the resnet152 as the backbone, the output of layer4(before avg_pooling) is [frame 2048 7 7], frames refer to the length of the clip. then stack clips, i get [T len 2048 7 7.]
So can you share how you handle the resnet152 and get the appearance feature claimed in the paper that the dim is T*d
thanks very much
thanks for your excellent work, and i have one question about the extracting 2d appearance feature. when using the resnet152 as the backbone, the output of layer4(before avg_pooling) is [frame 2048 7 7], frames refer to the length of the clip. then stack clips, i get [T len 2048 7 7.] So can you share how you handle the resnet152 and get the appearance feature claimed in the paper that the dim is T*d thanks very much