Closed moes1 closed 1 year ago
@moes1 sorry for the late reply, has this become clear over the years?
I'm curious about this as well. I've just built Laghos with GPU and MPI support, but I'm not able to get any results when I run with 2 processes. If I run on a node with 2 GPUs, it seems that Laghos is trying to use the same GPU for both processes. But even if I run with 2 nodes, one task per node, I still don't get a good result. If I run a single process against the GPU, it works as expected.
Typically, the assignment of different GPUs to different MPI ranks is done by the batch system on the machine. Basically, the batch system sets the environment variables CUDA_VISIBLE_DEVICES
or ROCR_VISIBLE_DEVICES
depending on how many tasks one runs on each compute node and how many GPUs are present on each node.
If on your system this is not done by the batch system, you can modify laghos.cpp
to use something like this:
const int num_gpus_per_node = 4;
dev = myid % num_gpus_per_node;
Device backend;
backend.Configure(device, dev);
In this case, you should also move Hypre::Init()
after backend.Configure(...)
, see https://github.com/mfem/mfem/issues/3370#issuecomment-1362620903.
I'm closing this. Please reopen if something is still unclear.
How are you intending for users to assign multiple ranks to multiple devices when using a GPU backend?
Unless I am missing something that isn't possible in the code as written. For example you could edit laghos.cpp by adding "dev=myid;" before:
https://github.com/CEED/Laghos/blob/master/laghos.cpp#L205
This works fine, but maybe this is not what was intended? Please let me know if I am missing something! Thank you.