SJTU-LuHe / TransVOD

The repository is the code for the paper "End-to-End Video Object Detection with Spatial-TemporalTransformers"
Apache License 2.0
212 stars 28 forks source link

Some questions about the lite version and ++ version #14

Open wullia opened 2 years ago

wullia commented 2 years ago

Dear authors: Thanks for your great works but I have some questions about the lite version and ++ version in your paper.

1.With the ResNet-101 backbone, ++version can outperform lite version about 1.x ap@50, but why the situation changed when using Swin-B as backbone.

  1. May I ask the training setting of Swin base version and the FPS of single frame baseline?

  2. Why the lite version can be so fast but accuracy drops significantly when window size = 1 compared to single frame baseline?

I will be appreciate for your response.