Closed bentsherman closed 5 years ago
So you can fix this issue pretty easily by just having the chunk select the OpenCL device provided in the settings, as in Single::setupOpenCL()
. This scheme assumes that you're only running one chunk per node, because otherwise the chunks on a node would grab the same GPU. However, right now I think we should just move forward with this assumption. We can use CUDA_VISIBLE_DEVICES
to control the visibility of GPUs, even for OpenCL (Kubernetes does this automatically). But chunks, unlike MPI workers, don't have a concept of local rank, so I don't know if the dynamic selection scheme will ever work for chunk run.
This was easy to fix since you did the debugging for me thanks. Let me know if it works now. :) Fixed in commit 924178ad60ae9c8687cc0eed0ed3c67d31633940.
It works! Now I just need to fix this GPU bug and then we'll really be cooking.
Now that I can test chunkrun on the NRP, I found that whenever I launch a job only the first pod (chunk) uses it's GPU. I think the cause in
Chunk::setupOpenCL()
. I don't know the code intimately but it looks like it was copied mostly verbatim from one of the other runners but the logic should be slightly different. I'll parse through the code and see if I can suggest the change that fixes it for me.