Chunk run does not use OpenCL if chunk index > 0

bentsherman commented 5 years ago

Now that I can test chunkrun on the NRP, I found that whenever I launch a job only the first pod (chunk) uses it's GPU. I think the cause in Chunk::setupOpenCL(). I don't know the code intimately but it looks like it was copied mostly verbatim from one of the other runners but the logic should be slightly different. I'll parse through the code and see if I can suggest the change that fixes it for me.

bentsherman commented 5 years ago

So you can fix this issue pretty easily by just having the chunk select the OpenCL device provided in the settings, as in Single::setupOpenCL(). This scheme assumes that you're only running one chunk per node, because otherwise the chunks on a node would grab the same GPU. However, right now I think we should just move forward with this assumption. We can use CUDA_VISIBLE_DEVICES to control the visibility of GPUs, even for OpenCL (Kubernetes does this automatically). But chunks, unlike MPI workers, don't have a concept of local rank, so I don't know if the dynamic selection scheme will ever work for chunk run.

4ctrl-alt-del commented 5 years ago

This was easy to fix since you did the debugging for me thanks. Let me know if it works now. :) Fixed in commit 924178ad60ae9c8687cc0eed0ed3c67d31633940.

bentsherman commented 5 years ago

It works! Now I just need to fix this GPU bug and then we'll really be cooking.

SystemsGenetics / ACE

Chunk run does not use OpenCL if chunk index > 0 #68