flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

set ROCR_VISIBLE_DEVICES and HIP_VISIBILE_DEVICES in gpu-affinity shell plugin #4534

Open ryanday36 opened 2 years ago

ryanday36 commented 2 years ago

The gpu-affinity plugin sets CUDA_VISIBLE_DEVICES, which governs which Nvidia GPUs are seen, but not the AMD GPU equivalent ROCR_VISIBLE_DEVICES or the hipcc equivalent HIP_VISIBLE_DEVICES.

I'm not quite clear on whether HIP_VISIBLE_DEVICES is necessary if ROCR_VISIBLE_DEVICES is set, but it does sound like there can be problems if both ROCR_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES are set, so this might not be quite as clean of a request as I first thought. See: https://rocmdocs.amd.com/en/latest/ROCm_System_Managment/ROCm-System-Managment.html#rocr-visible-devices https://bugs.schedmd.com/show_bug.cgi?id=11097

grondo commented 2 years ago

We could perhaps implement something site specific with a lua shell plugin. It could set ROCR_VISIBLE_DEVICES to CUDA_VISIBLE_DEVICES and unset the latter if necessary for the specific system. You could experiment with -o userrc=test.lua and see what works before deploying anything.

trws commented 2 years ago

That approach should work great, but I would say we should probably do it based on whether there's an AMD or CUDA GPU in the node if it's an option. There are a few ways we could check if the information isn't sent to the node, the only real question is what to do on a node that has both, but as far as I know I'm the only person crazy enough to run a node like that regularly so that's an issue we can deal with later.

In the end, what we likely want is a property on the GPU for what kind of GPU it is, AMD, NVIDIA, Intel (heh), or whatever and ideally include the GPU stable UUID so we don't have to worry about the unstable GPU numbering issue, it makes pre-configuration of nodes and localities a lot easier to do that since numbers change on things like reboot, or looking at them funny, while the UUIDs are permanent.

grondo commented 2 years ago

In the end, what we likely want is a property on the GPU for what kind of GPU it is, AMD, NVIDIA, Intel (heh), or whatever and ideally include the GPU stable UUID so we don't have to worry about the unstable GPU numbering issue, it makes pre-configuration of nodes and localities a lot easier to do that since numbers change on things like reboot, or looking at them funny, while the UUIDs are permanent.

Want to get a feature issue open on this? While properties can currently only be tied to an execution target (i.e. node) currently RFC 20 specifies the use of the @ character to target properties to resources that are children of node (Admittedly, Dong had asked for this and I don't really understand how it was meant to be used).

Actually if nodes always contain the same type of GPU I don't know why we can't just do amdgpu/nvidiagpu or gpu=amd/gpu=nvidia, then users can request a given gpu presence or type with --requires=gpu=amd.

I don't really understand how to translate a GPU id allocated by the scheduler to a UUID, so we'll need some background or reference material on that to devise a solution.

However, for this issue, will a lua plugin be sufficient, at least for the time being? It can just be installed via ansible on the systems that require ROCR_VISIBLE_DEVICES and could optionally unset CUDA_VISIBLE_DEVICES for the job if that is necessary.

trws commented 2 years ago

Yes, a lua plugin would be sufficient. The rest is really a longer term thing, and I think the solution to the scheduler translation issue would ideally be to have the scheduler use the actual physical ID of the GPU to begin with.

ryanday36 commented 2 years ago

A question on doing this with a lua plugin. I tried setting ROCR_VISIBLE_DEVICES based on CUDA_VISIBLE_DEVICES using shell.getenv and shell.setenv in a script loaded via -o userrc=testrc.lua, but it looks like CUDA_VISIBLE_DEVICES gets changed after my testrc.lua runs:

[day36@tioga11:flux_test]$ flux mini alloc -N1 -n64 -t30m
[day36@tioga20:flux_test]$ env | grep -i VIS
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[day36@tioga20:flux_test]$ env | grep VIS
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[day36@tioga20:flux_test]$ cat testrc.lua 
val = shell.getenv ("CUDA_VISIBLE_DEVICES")
print (val)
if val then
    shell.setenv ("ROCR_VISIBLE_DEVICES", val)
    shell.setenv ("HIP_VISIBLE_DEVICES", val)
end
[day36@tioga20:flux_test]$ flux mini run -n1 -g2 -o userrc=./testrc.lua env | grep VIS
0.042s: flux-shell[0]: stdout: 0,1,2,3,4,5,6,7
CUDA_VISIBLE_DEVICES=6,7
ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[day36@tioga20:flux_test]$

Is there a way to hook in later? I see that there are also task.getenv and task.setenv available to scripts run in a task context, but I'm not clear on how to get a script to run in, for example, task init.

grondo commented 2 years ago

I apologize, I should have posted an example of what I was thinking. Yes, you'll have to hook into task.init I think in case the user specifies -o gpu-affinity=per-task.

Adding callbacks from the lua initrc can be done with the plugin.register() function, e.g.:

plugin.register {
  name = "rocr",
  handlers = {
    {
        topic = "task.init",
        fn = function()
            val = shell.getenv ("CUDA_VISIBLE_DEVICES")
            if val then
                shell.setenv ("ROCR_VISIBLE_DEVICES", val)
                shell.setenv ("HIP_VISIBLE_DEVICES", val)
            end
        end
    }
  }
}
ryanday36 commented 2 years ago

Cool! That works much better (after substituting task. for shell.).

grondo commented 2 years ago

(after substituting task. for shell.).

Ah, oops. sorry!

trws commented 2 years ago

That looks like it should work, except that we need to unset CUDA_VISIBLE_DEVICES in that case, and maybe HIP_VISIBLE_DEVICES as well. Both of those are interpreted at the HIP layer, here: https://github.com/ROCm-Developer-Tools/HIP/blob/roc-2.1.0/src/hip_hcc.cpp#L1255

ROCR_VISIBLE_DEVICES is interpreted by the layer below that, and only passes on the result to hip, so if both of them are set to, say, 1 then we get badness because gpu 1 becomes gpu 0 in the rocr layer then the request for gpu 1 in the hip layer results in no remaining GPUs.

This is part of why we never actually added ROCR_VISIBLE_DEVICES before, hip listens to CUDA_VISIBLE_DEVICES, and most of the applications that came through only cared about that. Things like OpenMP offload runtimes and HSA applications hook in below that and ignore hip's configuration.

We did manage to convince AMD to add a UUID interface though, such that setting both to say GPU-365628c172834d70 poses no problem because it's the same GPU at both levels. Docs on the full interface here: https://rocmdocs.amd.com/en/latest/ROCm_System_Managment/ROCm-System-Managment.html#rocr-visible-devices

Either way, for now probably make it so that if the node has AMD gpus, just unset CUDA_VISIBLE_DEVICES and don't set HIP_VISIBLE_DEVICES. We can't compose them without doing some other work, and I'm not sure we ever safely can if a hip application is involved because of the nutty way that's interpreted.

ryanday36 commented 2 years ago

Sounds good. I now have:

[day36@tioga18:flux_test]$ flux mini run -n1 -g2 -o userrc=./testrc.lua env | grep VIS
ROCR_VISIBLE_DEVICES=6,7
[day36@tioga18:flux_test]$

I'll get ansible to stick a lua script that does that in /etc/flux/shell/lua.d/ on the systems with AMD GPUs.

trws commented 2 years ago

That sounds great, mind throwing up a PR with it as well? A bit of massaging to check what the node has could make that a reasonable solve to this that's portable.