gdtk-uq / gdtk

The Gas Dynamics Toolkit (GDTk) is a set of software tools for simulating high speed fluid flow, maintained at The University of Queensland and the University of Southern Queensland, Australia.
https://gdtk.uqcloud.net/
Other
59 stars 15 forks source link

Error running across multiple nodes #25

Closed zlpurdue closed 1 year ago

zlpurdue commented 1 year ago

Hello,

I ran into this issue while running cases on Notre Dame's HPC system. I can run cases just fine when only running on one node however when running across multiple nodes e4mpi keeps outputting errors. Attached is the output file with the errors I am getting.

Notre Dame support, after looking at the problem, says this is not an issue with their system. I recompiled but that didn't seem to resolve the issue.

ZL

Dist.o293344.txt

uqngibbo commented 1 year ago

Hi zl,

That's very strange. If the code definitely works on one node, but not an identical simulation on multiple nodes, then that is for sure a configuration issue with the system. What exactly did the support team say about it?

In the mean time it's worth double-checking a few things: 1.) Can you post what version of OpenMPI you are using. Also, make sure that the nodes running your code have the same version as the one you compiled with. You can check this by typing:

$ echo $PATH $ echo $LD_LIBRARY_PATH

and also putting the same commands in your job submission script.

2.) Use $ which e4mpi to find out where the executable is located and double check the time stamp on its creation. It's very easy to accidentally be using an old version of the program that isn't been overwritten when you recompile.

3.) Does your HPC use slurm as the queueing system? We've had issues in the past with using mpirun (bad) instead of srun (good) to launch jobs.

Let me know how you go,

Nick

zlpurdue commented 1 year ago

Nick,

My thoughts exactly. ND was adamant it wasn't on their end though.

  1. I am using v4.0.1. Nodes are using the same version as compiled with.

  2. It's using the most recent recompile

  3. It's using the Univa Grid Engine (UGE) batch submission system which uses mpirun. Sounds like this is the issue.

ZL

uqngibbo commented 1 year ago

Any action on this? All of that sounds perfectly reasonable to me.

zlpurdue commented 1 year ago

Any workarounds or tips for working with the different submission system?

ZL

rjgollan-on-github commented 1 year ago

Hi ZL, We haven't used UGE anywhere locally. I see that it's a fork from Sun Grid Engine. It's about a decade and a half since I've used SGE, so no tips sorry.

Your original error message relates to loading of shared libraries. This seems like a basic environment issue related to the LD_LIBRARY_PATH. I would work with your sys admins on getting a printout of that environment variable on non-master nodes when you're doing multi-node work.

What you would like is your environment at submission replicated across all nodes. How this is done differs on different batch/queue systems.

Cheers, Rowan