1adrianb / face-alignment

:fire: 2D and 3D Face alignment library build using pytorch
https://www.adrianbulat.com
BSD 3-Clause "New" or "Revised" License
6.99k stars 1.34k forks source link

[CPU Performance is Better then GPU] #150

Closed vinayak618 closed 5 years ago

vinayak618 commented 5 years ago

Hi @1adrianb .

I was bench marking your latest Pytorch source code for both 2D and 3D landmark detection with SFD face detector, I'm observing about 10x faster speed in CPU w.r.t to GPU, which is strange. Any help here would be appreciated.

CPU - Intel i9, 9th Generation Machine. GPU - GTX GeForce 1070 8GiB.

Thanks and Regards, Vinayak

1adrianb commented 5 years ago

Hi @vinayak618, This is strange indeed. How are you measuring this? Please note that the first image passed will be significantly slower since the network will copy and pytorch will initialize buffers internally.

vinayak618 commented 5 years ago

Hi @1adrianb,

Once the models are downloaded and setup is complete. I'm using your examples folder script and images to get the predictions for both 2D and 3D for SFD face detector. Is there any wrong i'm doing.?

1adrianb commented 5 years ago

I was referring to the fact that the initial call to get_landmarks will be slower. I am afraid I am unable to tell without having a code sample. Can you also check your GPU usage during the detection/training?

vinayak618 commented 5 years ago

Hi @1adrianb,

I ran the code again and observed upto 6GiB of GPU usage on my machine with 8GiB 1070 GeForce GTX. Still observing faster prediction in CPU, NO idea why.

Below is the code snippet i used as it is from your example folder test script to get the predictions only.

start_time = time.time() fa = face_alignment.FaceAlignment(face_alignment.LandmarksType._2D, face_detector='sfd', device='cpu') input = io.imread('../test/assets/aflw-test.jpg') preds = fa.get_landmarks(input) print("---` %s seconds ---" % (time.time() - start_time))

reddytocode commented 5 years ago

i've tested in real time resizing my input image (1024, 1024) and changing the face detector, makes a really great work in time.

1adrianb commented 5 years ago
  1. You should run this multiple times, as I was saying the first run will be noticeable slower. This will be particularly visible on a GPU where in addition to everything data will be copied to the GPU and the cuda will be initialized. i.e: try running fa.get_landmarks(input) 100 times for example, excluding the first run which is "warming up" the network.
  2. You shouldn't count the model creation time. fa = face_alignment.FaceAlignment(face_alignment.LandmarksType._2D, face_detector='sfd', device='cpu'). This is supposed to be run only once in your application anyway.
  3. Generally, since you are dealing with cuda, make sure to add a synchronize call, just in case some kernels haven't finished.

If you will do all of this I am sure the GPU will be significantly faster.

@Reddyforcode, yes the speed of the face detector will depend on the size of the face. The face alignment part is however independent of that.

vinayak618 commented 5 years ago

Hi @1adrianb,

yeah, i understood it now. Thanks for that. So the first call from GPU takes longer time then CPU in order to copy and initialize the data. I ran the detector and got the predictions with a loop of 100 and observed GPU is quite faster. And any idea how can i add synchronize call (I haven't quite worked more on CUDA kernels).

1adrianb commented 5 years ago

@vinayak618 please see https://pytorch.org/docs/stable/cuda.html#torch.cuda.synchronize