Open ryanday36 opened 2 years ago
We could perhaps implement something site specific with a lua shell plugin. It could set ROCR_VISIBLE_DEVICES to CUDA_VISIBLE_DEVICES and unset the latter if necessary for the specific system. You could experiment with -o userrc=test.lua
and see what works before deploying anything.
That approach should work great, but I would say we should probably do it based on whether there's an AMD or CUDA GPU in the node if it's an option. There are a few ways we could check if the information isn't sent to the node, the only real question is what to do on a node that has both, but as far as I know I'm the only person crazy enough to run a node like that regularly so that's an issue we can deal with later.
In the end, what we likely want is a property on the GPU for what kind of GPU it is, AMD, NVIDIA, Intel (heh), or whatever and ideally include the GPU stable UUID so we don't have to worry about the unstable GPU numbering issue, it makes pre-configuration of nodes and localities a lot easier to do that since numbers change on things like reboot, or looking at them funny, while the UUIDs are permanent.
In the end, what we likely want is a property on the GPU for what kind of GPU it is, AMD, NVIDIA, Intel (heh), or whatever and ideally include the GPU stable UUID so we don't have to worry about the unstable GPU numbering issue, it makes pre-configuration of nodes and localities a lot easier to do that since numbers change on things like reboot, or looking at them funny, while the UUIDs are permanent.
Want to get a feature issue open on this? While properties can currently only be tied to an execution target (i.e. node) currently RFC 20 specifies the use of the @
character to target properties to resources that are children of node
(Admittedly, Dong had asked for this and I don't really understand how it was meant to be used).
Actually if nodes always contain the same type of GPU I don't know why we can't just do amdgpu
/nvidiagpu
or gpu=amd
/gpu=nvidia
, then users can request a given gpu presence or type with --requires=gpu=amd
.
I don't really understand how to translate a GPU id allocated by the scheduler to a UUID, so we'll need some background or reference material on that to devise a solution.
However, for this issue, will a lua plugin be sufficient, at least for the time being? It can just be installed via ansible on the systems that require ROCR_VISIBLE_DEVICES
and could optionally unset CUDA_VISIBLE_DEVICES
for the job if that is necessary.
Yes, a lua plugin would be sufficient. The rest is really a longer term thing, and I think the solution to the scheduler translation issue would ideally be to have the scheduler use the actual physical ID of the GPU to begin with.
A question on doing this with a lua plugin. I tried setting ROCR_VISIBLE_DEVICES
based on CUDA_VISIBLE_DEVICES
using shell.getenv
and shell.setenv
in a script loaded via -o userrc=testrc.lua
, but it looks like CUDA_VISIBLE_DEVICES
gets changed after my testrc.lua runs:
[day36@tioga11:flux_test]$ flux mini alloc -N1 -n64 -t30m
[day36@tioga20:flux_test]$ env | grep -i VIS
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[day36@tioga20:flux_test]$ env | grep VIS
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[day36@tioga20:flux_test]$ cat testrc.lua
val = shell.getenv ("CUDA_VISIBLE_DEVICES")
print (val)
if val then
shell.setenv ("ROCR_VISIBLE_DEVICES", val)
shell.setenv ("HIP_VISIBLE_DEVICES", val)
end
[day36@tioga20:flux_test]$ flux mini run -n1 -g2 -o userrc=./testrc.lua env | grep VIS
0.042s: flux-shell[0]: stdout: 0,1,2,3,4,5,6,7
CUDA_VISIBLE_DEVICES=6,7
ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[day36@tioga20:flux_test]$
Is there a way to hook in later? I see that there are also task.getenv
and task.setenv
available to scripts run in a task context, but I'm not clear on how to get a script to run in, for example, task init.
I apologize, I should have posted an example of what I was thinking. Yes, you'll have to hook into task.init
I think in case the user specifies -o gpu-affinity=per-task
.
Adding callbacks from the lua initrc can be done with the plugin.register()
function, e.g.:
plugin.register {
name = "rocr",
handlers = {
{
topic = "task.init",
fn = function()
val = shell.getenv ("CUDA_VISIBLE_DEVICES")
if val then
shell.setenv ("ROCR_VISIBLE_DEVICES", val)
shell.setenv ("HIP_VISIBLE_DEVICES", val)
end
end
}
}
}
Cool! That works much better (after substituting task.
for shell.
).
(after substituting task. for shell.).
Ah, oops. sorry!
That looks like it should work, except that we need to unset CUDA_VISIBLE_DEVICES
in that case, and maybe HIP_VISIBLE_DEVICES
as well. Both of those are interpreted at the HIP layer, here: https://github.com/ROCm-Developer-Tools/HIP/blob/roc-2.1.0/src/hip_hcc.cpp#L1255
ROCR_VISIBLE_DEVICES
is interpreted by the layer below that, and only passes on the result to hip, so if both of them are set to, say, 1 then we get badness because gpu 1 becomes gpu 0 in the rocr layer then the request for gpu 1 in the hip layer results in no remaining GPUs.
This is part of why we never actually added ROCR_VISIBLE_DEVICES before, hip listens to CUDA_VISIBLE_DEVICES, and most of the applications that came through only cared about that. Things like OpenMP offload runtimes and HSA applications hook in below that and ignore hip's configuration.
We did manage to convince AMD to add a UUID interface though, such that setting both to say GPU-365628c172834d70
poses no problem because it's the same GPU at both levels. Docs on the full interface here: https://rocmdocs.amd.com/en/latest/ROCm_System_Managment/ROCm-System-Managment.html#rocr-visible-devices
Either way, for now probably make it so that if the node has AMD gpus, just unset CUDA_VISIBLE_DEVICES and don't set HIP_VISIBLE_DEVICES. We can't compose them without doing some other work, and I'm not sure we ever safely can if a hip application is involved because of the nutty way that's interpreted.
Sounds good. I now have:
[day36@tioga18:flux_test]$ flux mini run -n1 -g2 -o userrc=./testrc.lua env | grep VIS
ROCR_VISIBLE_DEVICES=6,7
[day36@tioga18:flux_test]$
I'll get ansible to stick a lua script that does that in /etc/flux/shell/lua.d/
on the systems with AMD GPUs.
That sounds great, mind throwing up a PR with it as well? A bit of massaging to check what the node has could make that a reasonable solve to this that's portable.
The gpu-affinity plugin sets CUDA_VISIBLE_DEVICES, which governs which Nvidia GPUs are seen, but not the AMD GPU equivalent ROCR_VISIBLE_DEVICES or the hipcc equivalent HIP_VISIBLE_DEVICES.
I'm not quite clear on whether HIP_VISIBLE_DEVICES is necessary if ROCR_VISIBLE_DEVICES is set, but it does sound like there can be problems if both ROCR_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES are set, so this might not be quite as clean of a request as I first thought. See: https://rocmdocs.amd.com/en/latest/ROCm_System_Managment/ROCm-System-Managment.html#rocr-visible-devices https://bugs.schedmd.com/show_bug.cgi?id=11097