Open Adjective-Object opened 3 years ago
Since the bulk of time is spent in synchronize() right now, It's unclear how much of that is in copying to the CPU versus just waiting for the network to complete. It's unclear how much real savings this will be able to net.
I think part of looking more closely at how we're using the GPU might mean getting rid of the multiple worker processes and managing multiple jobs on the same worker. I think That should also reduce the CUDA memory required to keep the model loaded in torch, since we'll be able to load one copy of the model and run multiple runs in parallel against it?
see: https://github.com/pytorch/pytorch/issues/48279
There are some catches to getting streamed execution in pytorch it seems.
The bulk of time in the worker process is spent blocking on GPU synchronization when reading the output of the neural network back to the CPU so it can be copied into shared memory.
Rather than copying the image data onto the CPU, then copying it into shared memory, then copying it again into an image, then copying it back to the GPU for display, we should keep the image data on the GPU the entire time.
tensor.__cuda_array_interface__['data']
-indirect
)