Adjective-Object / first-order-motion-tk

MIT License
1 stars 1 forks source link

CUDA output should not be copied to the CPU and back for display #16

Open Adjective-Object opened 3 years ago

Adjective-Object commented 3 years ago

The bulk of time in the worker process is spent blocking on GPU synchronization when reading the output of the neural network back to the CPU so it can be copied into shared memory.

Rather than copying the image data onto the CPU, then copying it into shared memory, then copying it again into an image, then copying it back to the GPU for display, we should keep the image data on the GPU the entire time.

Adjective-Object commented 3 years ago

Since the bulk of time is spent in synchronize() right now, It's unclear how much of that is in copying to the CPU versus just waiting for the network to complete. It's unclear how much real savings this will be able to net.

I think part of looking more closely at how we're using the GPU might mean getting rid of the multiple worker processes and managing multiple jobs on the same worker. I think That should also reduce the CUDA memory required to keep the model loaded in torch, since we'll be able to load one copy of the model and run multiple runs in parallel against it?

Adjective-Object commented 3 years ago

https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

Adjective-Object commented 3 years ago

see: https://github.com/pytorch/pytorch/issues/48279

There are some catches to getting streamed execution in pytorch it seems.