Some questions about the experiment in this paper

SJTU-LuHe / TransVOD

The repository is the code for the paper "End-to-End Video Object Detection with Spatial-TemporalTransformers"

Apache License 2.0

203 stars 28 forks source link

Some questions about the experiment in this paper #1

Open flying-hou opened 3 years ago

flying-hou commented 3 years ago

After reading your paper, I was deeply inspired.Your work has led to the successful application of Transformer on VOD. However, there are three questions I want to ask:

What is the type and quantity of GPU used in the experiment in the paper?
How long does it take to train 10(or 12) epochs?
What is the inference speed(FPS) of TransVOD? Thanks!

SJTU-LuHe commented 3 years ago

Thanks for you attention to our work. We train our model with 8 NVIDIA Tesla V100 SXM2 32 GB. In practice, we prefer to use our pre-trained still image detector (which will takes 4.8 hours with ResNet50 ) as the pre-trained model. With different number of reference frames, the training/inference time are shown as follows.

number of reference images 2 4 8 14
Training time (hours) 4. 9 6.7 9.8 12.7
inference time (s/per image) 0.2320 0.2527 0.3447 0.6241
MAP(%) 77.7 78.3 79.0 79.9

flying-hou commented 3 years ago

Thank you for your prompt reply.

SJTU-LuHe commented 3 years ago

We would like to update our inference time as the inference time in the previous response includes loss computation, mAP computation, results writing and so on. When the number of reference images is 2,4,8 and 14, the inference time is 88 ms, 123ms, 213 ms, and 341 ms respectively.

Lotus-95 commented 2 years ago

Thanks for you attention to our work. We train our model with 8 NVIDIA Tesla V100 SXM2 32 GB. In practice, we prefer to use our pre-trained still image detector (which will takes 4.8 hours with ResNet50 ) as the pre-trained model. With different number of reference frames, the training/inference time are shown as follows.

number of reference images 2 4 8 14

Training time (hours) 4. 9 6.7 9.8 12.7

inference time (s/per image) 0.2320 0.2527 0.3447 0.6241

MAP(%) 77.7 78.3 79.0 79.9

Is the training time for training one epoch or 10 epochs? So how many epochs for training these?