Closed jchtheron closed 1 year ago
I cannot reproduce the error on a linux computer running Centos.
[mcgratta@burn Test]$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
[mcgratta@burn Test]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
Thanks for the quick response.
I have tried the same cases on other hardware and have also not been able to recreate the same behaviour:
Model name: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz
Model name: AMD Ryzen 9 5950X 16-Core Processor
The hardware for which it currently does not work is
Model name: AMD EPYC 7313 16-Core Processor
Is it expected that FDS behaves differently on different processors?
Can you think of any other variables which may affect the behaviour of FDS?
You can try the latest release, although we are not doing anything different in the compile. We do not use chip-specific optimization, so FDS should run on Intel and AMD. That said, this appears to be something involving MPI, which means that there are other "variables" here. Is the computer with the EPYC chip different than the others in terms of operating system?
Using FDS 6.8.0 yields the same behaviour: working with 1 or more meshes under 1 MPI process but hanging indefinitely for multiple MPI processes. I think you're right about MPI being the culprit.
The CentOS 7 install is fairly stock except for the necessary configuration for an SGE scheduler. Please let me if this might point to a likely cause. In the meantime, I will try to field relevant info about the OS setup from the supplier or otherwise.
Are you using SGE on all platforms? We use SLURM here at NIST. There is nothing special about that in that way we compile, but that might be another clue.
The other systems do not have SGE installed.
While debugging, I am not running through the SGE scheduler - just following the documentation as closely as possible - to, hopefully, eliminate as much variables as possible. But SGE is the only thing I can think of that is installed system wide which makes this setup differ from a stock CentOS 7 install.
What if you try to run the case using an SGE run script?
This is the end goal. I noticed the issue while trying to run a case through SGE. The example here is my attempt to debug but I am getting the same strange behaviour without SGE.
For reference, here is a minimal SGE submission script:
#!/bin/sh
#$ -N fds_debug
#$ -pe parallel_environment 2
#$ -S /bin/sh
#$ -j y
#$ -cwd
umask 000
# Sun Grid Engine
FDS_NP=$NSLOTS
FDS_HF=machines
cut -d" " -f1,2 < $PE_HOSTFILE | sed 's/ /:/' > $FDS_HF
# FDS Environment
ulimit -s unlimited
FDS_PATH=/cluster/programs/fds/fds-6.7.9
source $FDS_PATH/bin/FDS6VARS.sh
source $FDS_PATH/bin/SMV6VARS.sh
export OMP_NUM_THREADS=1
# Run
mpiexec -n $FDS_NP -machine $FDS_HF $FDS_PATH/bin/fds test.fds
I am out of ideas. The last recourse is to compile the code yourself on that system. There are many libraries that are involved in MPI and I do not know whether or not yours are all compatible with the executable that we build.
did you run your multi process cases that failed just using mpiexec ie not using the job scheduler SGE?
I am experiencing the same behaviour both using mpiexec
and through the job scheduler.
Some additional testing: Intel E5-1650 V4 Running CentOS 7.8 (worked) Intel Xeon Silver 4116 Running CentOS 7.6 (worked) AMD EPYC 7281 Running CentOS 7.5 (does not work)
It appears to be an issue involving Intel MPI on AMD EPYC.
Some searching online seem to suggest that there are indeed some issues with this combination of MPI and hardware, however I am yet to find a solution. I will update here if I make any progress.
I have been able to run the case successfully with an older version of Intel MPI:
Intel MPI Library 2018 Update 3 for Linux* OS
Version: 2018.3.222
This particular install come pre-packaged with Ansys CFX 2020.
Unfortunately, I have not been able to make much progress with the Intel MPI that came with FDS.
If this is an issue of library/version/manufacturer incompatibilities, I am not going to be much help either. Usually, when someone has an OS or hardware that doesn't work with our release executable, we recommend that they compile themselves.
Yes, understandable.
I am happy for you to close the issue - I will follow up with Intel/AMD if possible or compile if I run into further issues.
OK, thanks. If you notice other troubles with AMD chips, let us know. It is still very difficult to know if functionality issues can be traced back to the actual chip.
Good Afternoon,
I am unable to run FDS in parallel.
I may be missing something fundamental, but have prepared a simple case below to reproduce the issue.
The Issue
Running FDS with a 2-mesh case on 1 MPI process works:
Running FDS with a 2-mesh case on 2 MPI processes hangs indefinitely after the first iteration:
Using
top
, one can see that the 2 MPI processes have been spawned:Supplementary information
Operating System:
FDS 6.7.9 is installed and the environment has been sourced:
OpenMP is disabled since the CPUs have 1 thread per core:
Here is a super simple case that runs in seconds:
Here is the same case with the mesh split into two in the Z-direction:
Running FDS with a single mesh case in series works:
Running FDS with a single mesh case on 1 MPI process works:
I hope I am missing something simple. Please let me know if anyone has dealt with an issue like this before.