D-X-Y / landmark-detection

Four landmark detection algorithms, implemented in PyTorch.
https://xuanyidong.com/assets/projects/TPAMI-2020-SRT.html
MIT License
925 stars 180 forks source link

SAN & SBR are slower on GPU Than CPU #47

Closed ModarD closed 5 years ago

ModarD commented 5 years ago

Which project are you using?

SAN or SBR

I run SAN and SBR on google colab with Tesla K80 GPU and CUDA V10.0.130, but the execution time is always longer on GPU than it is on CPU.

SAN: GPU = 2.49453s, CPU = 1.21520s SBR: GPU = 6.39389s, CPU = 1.90000s

Any idea what could cause this issue?

Thanks!

D-X-Y commented 5 years ago

Would you mind to provide more details? How did you test the speed? Is it possible to provide a small script for reproducing this?

ModarD commented 5 years ago

I wrap the network forward with time.time(), for example in SAN I wrap line 60 https://github.com/D-X-Y/landmark-detection/blob/4cd4531d1088044a80a22e9d7e5c9f91d21df988/SAN/san_eval.py#L60 like this:

    t1 = time.time()
    batch_heatmaps, batch_locs, batch_scos, _ = net(inputs)
    t2 = time.time()
    t = t2 - t1
    print('{:0.5f}s'.format(t))
D-X-Y commented 5 years ago

It could be caused by the warmup procedure of GPU. You can try to run the forward procedure 100 times for warmup, and then run another 50 times of the forward and count the average time.

ModarD commented 5 years ago

Thank you! that was the issue indeed

    for i in range(0,100):
        t1 = time.time()
        batch_heatmaps, batch_locs, batch_scos, _ = net(inputs)
        t2 = time.time()
        t = t2 - t1
        print('{:0.5f}s'.format(t))

The output is:

2.48314s
0.05905s
0.05862s
0.05740s
0.05713s
...

Thanks!

D-X-Y commented 5 years ago

No worries.