Multi GPU runs - Githubissues

CEED / Laghos

High-order Lagrangian Hydrodynamics Miniapp

http://ceed.exascaleproject.org/miniapps

BSD 2-Clause "Simplified" License

187 stars 60 forks source link

Multi GPU runs #158

Closed moes1 closed 1 year ago

moes1 commented 3 years ago

How are you intending for users to assign multiple ranks to multiple devices when using a GPU backend?

Unless I am missing something that isn't possible in the code as written. For example you could edit laghos.cpp by adding "dev=myid;" before:

https://github.com/CEED/Laghos/blob/master/laghos.cpp#L205

This works fine, but maybe this is not what was intended? Please let me know if I am missing something! Thank you.

vladotomov commented 1 year ago

@moes1 sorry for the late reply, has this become clear over the years?

cdm-work commented 1 year ago

I'm curious about this as well. I've just built Laghos with GPU and MPI support, but I'm not able to get any results when I run with 2 processes. If I run on a node with 2 GPUs, it seems that Laghos is trying to use the same GPU for both processes. But even if I run with 2 nodes, one task per node, I still don't get a good result. If I run a single process against the GPU, it works as expected.

v-dobrev commented 1 year ago

Typically, the assignment of different GPUs to different MPI ranks is done by the batch system on the machine. Basically, the batch system sets the environment variables CUDA_VISIBLE_DEVICES or ROCR_VISIBLE_DEVICES depending on how many tasks one runs on each compute node and how many GPUs are present on each node.

If on your system this is not done by the batch system, you can modify laghos.cpp to use something like this:

   const int num_gpus_per_node = 4;
   dev = myid % num_gpus_per_node;
   Device backend;
   backend.Configure(device, dev);

In this case, you should also move Hypre::Init() after backend.Configure(...), see https://github.com/mfem/mfem/issues/3370#issuecomment-1362620903.

vladotomov commented 1 year ago

I'm closing this. Please reopen if something is still unclear.