CUDA output should not be copied to the CPU and back for display

Adjective-Object commented 3 years ago

The bulk of time in the worker process is spent blocking on GPU synchronization when reading the output of the neural network back to the CPU so it can be copied into shared memory.

Rather than copying the image data onto the CPU, then copying it into shared memory, then copying it again into an image, then copying it back to the GPU for display, we should keep the image data on the GPU the entire time.

Step 1: initialize a context directly from the workers, and try to get a texture out of tensor.__cuda_array_interface__['data']
- https://gist.github.com/victor-shepardson/5b3d3087dc2b4817b9bffdb8e87a57c4
- how to ensure the GPU operation has completed before copying?
- could force a synchronize() but this defeats a big chunk of the purpose of avoiding the copies.
- could read events https://pytorch.org/docs/stable/generated/torch.cuda.Event.html
- disable the CPU sync and see if there are meaningful FPS times savings before proceeding.
Step 2 : initialize a context manually in the worker and the parent w/ sharing in GLX, and try to copy the data between the contexts with a PBO. This will be replacing the easy glumpy.Window() with manually creating a sharable context with glx
- https://glumpy.readthedocs.io/en/latest/tutorial/hardway.html
- https://www.opengl.org/resources/libraries/glut/spec3/node10.html (see -indirect)
- https://khronos.org/registry/OpenGL-Refpages/gl2.1/xhtml/glXCreateContext.xml All rendering contexts that share a single display-list space must themselves exist in the same
Step 3: Add fallback to upload CPU-backed data to the sahred PBOs for non-cuda workers. address space. Two rendering contexts share an address space if both are nondirect using the same server, or if both are direct and owned by a single process. Note
- https://gregnott.wordpress.com/some-usefull-facts-about-multipul-opengl-contexts/#sharing
Step 4: Add a wrapper around the context sharing logic so that shared contexts can be created with glx or wgl depending on user OS
- https://docs.microsoft.com/en-us/windows/win32/api/wingdi/nf-wingdi-wglsharelists
- https://www.khronos.org/opengl/wiki/Creating_an_OpenGL_Context_(WGL)#Get_WGL_Extensions

Adjective-Object commented 3 years ago

Since the bulk of time is spent in synchronize() right now, It's unclear how much of that is in copying to the CPU versus just waiting for the network to complete. It's unclear how much real savings this will be able to net.

I think part of looking more closely at how we're using the GPU might mean getting rid of the multiple worker processes and managing multiple jobs on the same worker. I think That should also reduce the CUDA memory required to keep the model loaded in torch, since we'll be able to load one copy of the model and run multiple runs in parallel against it?

Adjective-Object commented 3 years ago

https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

Adjective-Object commented 3 years ago

see: https://github.com/pytorch/pytorch/issues/48279

There are some catches to getting streamed execution in pytorch it seems.

Adjective-Object / first-order-motion-tk

CUDA output should not be copied to the CPU and back for display #16