Some questions about the lite version and ++ version

Dear authors: Thanks for your great works but I have some questions about the lite version and ++ version in your paper.

1.With the ResNet-101 backbone, ++version can outperform lite version about 1.x ap@50, but why the situation changed when using Swin-B as backbone.

May I ask the training setting of Swin base version and the FPS of single frame baseline?
Why the lite version can be so fast but accuracy drops significantly when window size = 1 compared to single frame baseline?

I will be appreciate for your response.

SJTU-LuHe / TransVOD