HaozheQi / P2B

P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds
189 stars 36 forks source link

Timing details for SC3D #5

Closed SilvioGiancola closed 4 years ago

SilvioGiancola commented 4 years ago

Dear @HaozheQi,

I tried to contact you via email but I email could find you (rejected email from your server). Congrats for your paper accepted as an oral to CVPR'20, merging 3D Siamese network with VoteNet is a nice idea.

I went through the details of your paper and wondered how you estimated the "Running speed" in section 4.5. You claim that "SC3D in default setting ran with 1.8 FPS on the same platform.". However the original SC3D paper states the following: "Our model takes on average 1.8ms to evaluate 147 candidates." (Section 5 on Timing). Can you enlighten me where the 1.8 FPS originates from? I am afraid it might be a typo.

Thank you,

Silvio Giancola

HaozheQi commented 4 years ago

Dear @SilvioGiancola,

Thanks for your attention to our work. Your pioneer effort on this topic inspires us a lot.

We re-checked your paper and found that your 1.8ms referred to network forward propagation. Comparatively, to simulate real scenarios, we evaluated SC3D’s speed and ours based on the entire tracking process, including point cloud processing, network forward propagation and post-processing. We ran the two models on all test frames and reported their average in our paper: 45.5fps for P2B and 1.8fps for SC3D on a single 1080Ti GPU.

Here we re-tested SC3D and show some intermediate results for your possible reference: 1). The initial setting of SC3D image 2). Results Platform: A single NVIDIA 1080Ti GPU. point cloud processing network forward propagation post-processing Total
520.9ms 2.77ms 36.0ms 559.7ms (≈1.8fps)
SilvioGiancola commented 4 years ago

Dear @HaozheQi, thank you for the timing details, I was able to reproduce the timing you reported. I saw my misunderstanding, you are actually including the pre-processing as well.

In the SC3D paper, we only reported the time for the forward pass otf model. We are not considering the timing of the pre-processing with the current implementation, in particular because the BB generation and PC cropping/centering were not optimized, performed naively on the CPU rather than the GPU, for sake of simplicity (similar functions you are using here as well).

Are you reporting the same pre-processing to reach 7.0ms? I guess the speed-up originates from the fact you don't have to generate as many BB proposals thanks to your voting approach. How about post-processing? The post-processing in SC3D takes 36.0ms because of memory transfer between GPU and CPU. How large was K for P2B to reach only 0.9ms? Any insight would be welcome :)

Cheers, Silvio

HaozheQi commented 4 years ago

Dear @SilvioGiancola, sorry for late reply. Thanks for your inspiring feedback; we re-consider the timing of different steps and find some new problems.

We can see in test_tracking.py, P2B applies the same pre-processing and no GPU optimization as in SC3D. P2B runs faster here because it simply operates on two point clouds (for template and search area).

As for the post-processing, I originally felt confused as well. It seemed the memory transfer between GPU and CPU that took the most time, where SC3D reported 35.0ms and P2B 0.2ms ($K=64$). Here is the timing code:

start_model_time = time.time()
RUNNING CODE
end_model_time = time.time()
model_time.update((end_model_time-start_model_time))

But I just found that cuda operations are asynchronous (ref). To exactly compute the time, we need to write:

torch.cuda.synchronize()
start_model_time = time.time()
RUNNING CODE
torch.cuda.synchronize()
end_model_time = time.time()
model_time.update((end_model_time-start_model_time))

In other word, the "35.0ms" mainly came from model propagation instead of the GPU-to-CPU memory transfer. I accordingly re-tested the time and found that: 1) in S3CD, the model propagation took 37.2ms and the post processing 1.2ms; 2) P2B timing changed little with the new code (we suppose it might be due to that P2B does not need to process batches as in SC3D during test stage). You may also check this part.

SilvioGiancola commented 4 years ago

Interesting observation, it is indeed important to synchronize the cuda operations for a fair comparison. I will run it on my side as well. What timing does P2B reach for pre/post processing and model inference? I will report the same with torch.cuda.synchronize(), in particular:

What are the equivalent portions of code for P2B?

HaozheQi commented 4 years ago

Dear @SilvioGiancola, we provide a timing schedule consistent to our paper, where we exclude the first frame and report the averaged tracking time for different steps:

Accordingly, we synchronized the cuda operations and re-counted the timing of P2B and SC3D under the same setting (car dataset; one single NVIDIA 1080Ti). Here are the results for your reference .

pre-processing model inference post-processing total
P2B 5.6ms 13.2ms 0.76ms 19.56ms
SC3D 387.3ms 37.1ms 1.22ms 425.62ms