Timing details for SC3D

SilvioGiancola commented 4 years ago

Dear @HaozheQi,

I tried to contact you via email but I email could find you (rejected email from your server). Congrats for your paper accepted as an oral to CVPR'20, merging 3D Siamese network with VoteNet is a nice idea.

I went through the details of your paper and wondered how you estimated the "Running speed" in section 4.5. You claim that "SC3D in default setting ran with 1.8 FPS on the same platform.". However the original SC3D paper states the following: "Our model takes on average 1.8ms to evaluate 147 candidates." (Section 5 on Timing). Can you enlighten me where the 1.8 FPS originates from? I am afraid it might be a typo.

Thank you,

Silvio Giancola

HaozheQi commented 4 years ago

Dear @SilvioGiancola,

Thanks for your attention to our work. Your pioneer effort on this topic inspires us a lot.

We re-checked your paper and found that your 1.8ms referred to network forward propagation. Comparatively, to simulate real scenarios, we evaluated SC3D’s speed and ours based on the entire tracking process, including point cloud processing, network forward propagation and post-processing. We ran the two models on all test frames and reported their average in our paper: 45.5fps for P2B and 1.8fps for SC3D on a single 1080Ti GPU.

Here we re-tested SC3D and show some intermediate results for your possible reference: 1). The initial setting of SC3D 2). Results Platform: A single NVIDIA 1080Ti GPU. point cloud processing	network forward propagation	post-processing	Total
520.9ms	2.77ms	36.0ms	559.7ms (≈1.8fps)

SilvioGiancola commented 4 years ago

Dear @HaozheQi, thank you for the timing details, I was able to reproduce the timing you reported. I saw my misunderstanding, you are actually including the pre-processing as well.

In the SC3D paper, we only reported the time for the forward pass otf model. We are not considering the timing of the pre-processing with the current implementation, in particular because the BB generation and PC cropping/centering were not optimized, performed naively on the CPU rather than the GPU, for sake of simplicity (similar functions you are using here as well).

Are you reporting the same pre-processing to reach 7.0ms? I guess the speed-up originates from the fact you don't have to generate as many BB proposals thanks to your voting approach. How about post-processing? The post-processing in SC3D takes 36.0ms because of memory transfer between GPU and CPU. How large was K for P2B to reach only 0.9ms? Any insight would be welcome :)

Cheers, Silvio

HaozheQi commented 4 years ago

Dear @SilvioGiancola, sorry for late reply. Thanks for your inspiring feedback; we re-consider the timing of different steps and find some new problems.

We can see in test_tracking.py, P2B applies the same pre-processing and no GPU optimization as in SC3D. P2B runs faster here because it simply operates on two point clouds (for template and search area).

As for the post-processing, I originally felt confused as well. It seemed the memory transfer between GPU and CPU that took the most time, where SC3D reported 35.0ms and P2B 0.2ms ($K=64$). Here is the timing code:

start_model_time = time.time()
RUNNING CODE
end_model_time = time.time()
model_time.update((end_model_time-start_model_time))

But I just found that cuda operations are asynchronous (ref). To exactly compute the time, we need to write:

torch.cuda.synchronize()
start_model_time = time.time()
RUNNING CODE
torch.cuda.synchronize()
end_model_time = time.time()
model_time.update((end_model_time-start_model_time))

In other word, the "35.0ms" mainly came from model propagation instead of the GPU-to-CPU memory transfer. I accordingly re-tested the time and found that: 1) in S3CD, the model propagation took 37.2ms and the post processing 1.2ms; 2) P2B timing changed little with the new code (we suppose it might be due to that P2B does not need to process batches as in SC3D during test stage). You may also check this part.

SilvioGiancola commented 4 years ago

Interesting observation, it is indeed important to synchronize the cuda operations for a fair comparison. I will run it on my side as well. What timing does P2B reach for pre/post processing and model inference? I will report the same with torch.cuda.synchronize(), in particular:

pre-processing from L286 to L431
model inference from L431 to L434
post-processing from L434 to L479

What are the equivalent portions of code for P2B?

HaozheQi commented 4 years ago

Dear @SilvioGiancola, we provide a timing schedule consistent to our paper, where we exclude the first frame and report the averaged tracking time for different steps:

pre-processing from L66 to L102
model inference from L102 to L104
post-processing from L104 to L110

Accordingly, we synchronized the cuda operations and re-counted the timing of P2B and SC3D under the same setting (car dataset; one single NVIDIA 1080Ti). Here are the results for your reference .

	pre-processing	model inference	post-processing	total
P2B	5.6ms	13.2ms	0.76ms	19.56ms
SC3D	387.3ms	37.1ms	1.22ms	425.62ms

HaozheQi / P2B

Timing details for SC3D #5