firemodels / fds

Fire Dynamics Simulator
https://pages.nist.gov/fds-smv/
Other
641 stars 614 forks source link

Running FDS in parallel on Linux cluster #12248

Closed JulietteFranque closed 9 months ago

JulietteFranque commented 9 months ago

Hi all, I am trying to run a simulation in parallel on a Linux cluster, specifically lonestar6 at TACC (https://docs.tacc.utexas.edu/hpc/lonestar6/). It uses SLURM. I am not new to FDS but new to submitting jobs and parallel computing, so please bear with me.

I've set up a "dummy" simulation in which nothing happens for testing purposes: bus.txt

&HEAD CHID='bus'/
&TIME T_END=10.0/

&MESH ID='mesh1', IJK=5, 10, 10, XB=0, 0.5, 0, 1, 0, 1, MPI_PROCESS=0 /
&MESH ID='mesh2', IJK=5, 10, 10, XB=0.5, 1, 0, 1, 0, 1, MPI_PROCESS=1 /

&TAIL/

I am able to run this simulation in parallel on my mac laptop (locally) using mpiexec, and I can run it serially (modified to only have one mesh) on the TACC computer. To try and run it in parallel on TACC, I use the following batch file:

#!/bin/bash
                          # Use bash shell
#SBATCH -J myjob          # Job Name
#SBATCH -o myjob.o      # Name of the output file (myMPI.oJobID)
#SBATCH -p development    # Queue name
#SBATCH -t 00:05:00       # Run time (hh:mm:ss) - 5 minutes
#SBATCH -N 2             
#SBATCH -n 2             
#SBATCH -e myjob.e    # Direct error to the error file

#SBATCH --mail-type=all    # Send email at begin and end of job
#SBATCH --mail-user=juliette.franqueville@utexas.edu

module list

# Any other commands must follow all #SBATCH directives...
ibrun fds  bus.fds

Note that "ibrun" is the command that TACC says to use for parallel computing (I quote, "To launch an MPI application, use the TACC-specific MPI launcher ibrun, which is a Lonestar6-aware replacement for generic MPI launchers like mpirun and mpiexec"). The problem is that the simulation "hangs": fds seems to start, but the simulation does not start. The two output files show:

Currently Loaded Modules:
  1) intel/19.1.1   3) python3/3.9.7   5) pmix/3.2.3     7) TACC
  2) impi/19.0.9    4) cmake/3.24.2    6) xalt/2.10.32

 Starting FDS ...

 MPI Process      1 started on c306-005.ls6.tacc.utexas.edu
 MPI Process      0 started on c304-006.ls6.tacc.utexas.edu

 Reading FDS input file ...

 Fire Dynamics Simulator

 Current Date     : November 28, 2023  11:19:05
 Revision         : FDS-6.8.0-0-g886e009-release
 Revision Date    : Tue Apr 18 07:06:40 2023 -0400
 Compiler         : ifort version 2021.7.1
 Compilation Date : Apr 18, 2023 15:20:17

 MPI Enabled;    Number of MPI Processes:   2
 OpenMP Disabled

 MPI version: 3.1
 MPI library version: Intel(R) MPI Library 2021.6 for Linux* OS

 Job TITLE        :
 Job ID string    : bus

(remains the same after 5 mins) and

TACC:  Starting up job 1365271
TACC:  Starting parallel tasks...

Note that I've tried to run a simple Hello World fortran script in parallel on TACC, using ibrun as above, which worked.

Any idea as to what could be the issue? Any tips for debugging it are appreciated. Thank you! Juliette

mcgratta commented 9 months ago

Have a look at this and then we can address issues specific to your cluster.

JulietteFranque commented 9 months ago

Thank you for the quick reply. This is the page I used to install FDS on the cluster. I installed the pre-compiled version rather than cloning the git repo (should I have done the latter?). I typed ulimit -s unlimited in the terminal, which I had not done prior. I had a closer look at the instructions for SLURM and tried to submit a job in the format provided (using srun). Input and output files are shown below (it does not seem to be working properly).

#!/bin/bash
#SBATCH -J job_name
#SBATCH -e job_name.err
#SBATCH -o job_name.log
#SBATCH --partition=development
#SBATCH --ntasks=2
#SBATCH --nodes=2
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:05:00
export OMP_NUM_THREADS=1
srun -N 2 -n 2 --ntasks-per-node 1 /home1/07969/juliette/FDS/FDS6/bin/fds bus.fds 
Starting FDS ...

 MPI Process      0 started on c303-005.ls6.tacc.utexas.edu

 Reading FDS input file ...

WARNING: MPI_PROCESS set for MESH 2 and only one MPI process exists

 Fire Dynamics Simulator

 Current Date     : November 28, 2023  12:35:00
 Revision         : FDS-6.8.0-0-g886e009-release
 Revision Date    : Tue Apr 18 07:06:40 2023 -0400
 Compiler         : ifort version 2021.7.1
 Compilation Date : Apr 18, 2023 15:20:17

 MPI Enabled;    Number of MPI Processes:   1
 OpenMP Disabled

 MPI version: 3.1
 MPI library version: Intel(R) MPI Library 2021.6 for Linux* OS

 Job TITLE        :
 Job ID string    : bus

 Starting FDS ...

 MPI Process      0 started on c306-005.ls6.tacc.utexas.edu

 Reading FDS input file ...

WARNING: MPI_PROCESS set for MESH 2 and only one MPI process exists

 Fire Dynamics Simulator

 Current Date     : November 28, 2023  12:35:00
 Revision         : FDS-6.8.0-0-g886e009-release
 Revision Date    : Tue Apr 18 07:06:40 2023 -0400
 Compiler         : ifort version 2021.7.1
 Compilation Date : Apr 18, 2023 15:20:17

 MPI Enabled;    Number of MPI Processes:       1
 OpenMP Disabled
 MPI library version: Intel(R) MPI Library 2021.6 for Linux* OS

 Job TITLE        :
 Job ID string    : bus

 Time Step:  1, Simulation Time:      0.16 s
 Time Step:  2, Simulation Time:      0.32 s
 Time Step:      3, Simulation Time:      0.48 s
 Time Step:  4, Simulation Time:      0.64 s
 Time Step:      5, Simulation Time:      0.80 s
 Time Step:  6, Simulation Time:      0.96 s
 Time Step:      7, Simulation Time:      1.12 s
 Time Step:  8, Simulation Time:      1.28 s
 Time Step:      9, Simulation Time:      1.44 s
 Time Step:     10, Simulation Time:      1.60 s
 Time Step:     20, Simulation Time:      3.19 s
 Time Step:     30, Simulation Time:      4.79 s
 Time Step:     40, Simulation Time:      6.39 s
 Time Step:     50, Simulation Time:      7.98 s
 Time Step:     60, Simulation Time:      9.58 s
 Time Step:     63, Simulation Time:     10.00 s

STOP: FDS completed successfully (CHID: bus)
 Time Step:  1, Simulation Time:      0.16 s
 Time Step:  2, Simulation Time:      0.32 s
 Time Step:  3, Simulation Time:      0.48 s
 Time Step:  4, Simulation Time:      0.64 s
 Time Step:  5, Simulation Time:      0.80 s
 Time Step:  6, Simulation Time:      0.96 s
 Time Step:  7, Simulation Time:      1.12 s
 Time Step:  8, Simulation Time:      1.28 s
 Time Step:  9, Simulation Time:      1.44 s
 Time Step:     10, Simulation Time:      1.60 s
 Time Step:     20, Simulation Time:      3.19 s
 Time Step:     30, Simulation Time:      4.79 s
 Time Step:     40, Simulation Time:      6.39 s
 Time Step:     50, Simulation Time:      7.98 s
 Time Step:     60, Simulation Time:      9.58 s
 Time Step:     63, Simulation Time:     10.00 s

STOP: FDS completed successfully (CHID: bus)
MPI startup(): PMI server not found. Please set I_MPI_PMI_LIBRARY variable if it is not a singleton case.
MPI startup(): PMI server not found. Please set I_MPI_PMI_LIBRARY variable if it is not a singleton case.
marcosvanella commented 9 months ago

Hi Juliette, do you load your environment modules (compilers,mpi etc) in your .bashrc file? If not load them in the submission script. Then try mpirun instead of srun mpirun -n 2 /home1/07969/juliette/FDS/FDS6/bin/fds bus.fds

JulietteFranque commented 9 months ago

Hi Marcos, No, I do not load anything. Excuse my ignorance, but which specific modules should I be loading? My .profile file only contains:

source /home1/07969/juliette/FDS/FDS6/bin/FDS6VARS.sh
source /home1/07969/juliette/FDS/FDS6/bin/SMV6VARS.sh

# -*- shell-script -*-
# TACC startup script: ~/.profile version 2.1  -- 12/17/2013

# This file is sourced on login shells but only if ~/.bash_profile
# or ~/.bash_login do not exist! This file is not sourced on
# non-login interactive shells.

# We recommend you place in your ~/.bashrc everything you wish to be
# available in both login and non-login interactive shells. Remember
# that any commands that are placed here will be unavailable in VNC
# and other interactive non-login sessions.

# Note that ~/.bashrc is not automatically sourced on login shells.
# By sourcing here, we insure that login and non-login shells have
# consistent environments.

if [ -f $HOME/.bashrc ]; then
  source $HOME/.bashrc
fi
marcosvanella commented 9 months ago

Sorry, I missed your comment on using the precompiled version. Have you loaded the intel/19.1.1 and impi modules? Note the bundle fds is compiled with ifort version 2021.7.1. You might be getting intel libraries getting mixed up. Try unloading those and using mpirun.

JulietteFranque commented 9 months ago

I had not loaded them (at least not as far as I know), but they were loaded somehow (I could see them with module list). I unloaded both intel/19.1.1 and impi as you suggested and used mpirun (mpirun -n 2 /home1/07969/juliette/FDS/FDS6/bin/fds bus.fds) instead of srun. However, I get the error /var/spool/slurmd/job1365738/slurm_script: line 13: mpirun: command not found when the job starts. Note that the cluster website says not to use mpirun, but instead to use "ibrun".

JulietteFranque commented 9 months ago

If it's of any use, after unloading the modules, I also tried running it using ibrun


#!/bin/bash
                          # Use bash shell
#SBATCH -J myjob          # Job Name
#SBATCH -o myjob.o      # Name of the output file (myMPI.oJobID)
#SBATCH -p development    # Queue name
#SBATCH -t 00:02:00       # Run time (hh:mm:ss) - 5 minutes
#SBATCH -N 2       
#SBATCH -n 2             
#SBATCH -e myjob.e    # Direct error to the error file

#SBATCH --mail-type=all    # Send email at begin and end of job
#SBATCH --mail-user=juliette.franqueville@utexas.edu

module list
export OMP_NUM_THREADS=1
# Any other commands must follow all #SBATCH directives...
ibrun /home1/07969/juliette/FDS/FDS6/bin/fds  bus.fds

This time the error is

  1) cmake/3.24.2   2) pmix/3.2.3   3) xalt/2.10.32   4) TACC

Inactive Modules:
  1) python3

srun: cluster configuration lacks support for cpu binding
/home1/07969/juliette/FDS/FDS6/bin/fds: error while loading shared libraries: libmpifort.so.12: cannot open shared object file: No such file or directory
srun: error: c302-006: task 0: Exited with exit code 127
srun: launch/slurm: _step_signal: Terminating StepId=1365740.0
/home1/07969/juliette/FDS/FDS6/bin/fds: error while loading shared libraries: libmpifort.so.12: cannot open shared object file: No such file or directory
srun: error: c310-006: task 1: Exited with exit code 127
marcosvanella commented 9 months ago

Do you have Intel OneAPI as a module on this machine? Might be easier in the end to compile the code from source.

JulietteFranque commented 9 months ago

I tried module keyword "One", the only module returned was oneapi_rk: oneapi_rk/2021.4.0. Is this OneAPI or something different?

marcosvanella commented 9 months ago

You probably want to send an email to the IT person managing the cluster to make sure. A way to test if it has the intel fortran compiler is, unload the previous intel modules, load this oneapi_rk module and type ifort.

JulietteFranque commented 9 months ago
-bash: ifort: command not found

Sounds like it's a no... I can start a ticket with IT to make sure they do not have OneAPI? Regarding the other option (compiling the code from source), given the modules that we know I do have - how would I go about compiling it myself? I see some directions here https://github.com/firemodels/fds/wiki/Git-Notes-Getting-Started. Thanks! edit: https://www.tacc.utexas.edu/use-tacc/software-list/? this has a list of all packages

marcosvanella commented 9 months ago

I would ask them to see if they have the latest intel compiler and libraries suite. Otherwise you can try using the intel compiler 19 you have. This might fail as we use some standard 2018 directives in fortran which might not be implemented on this version. Yes, you clone the repo, go to fds/Build/impi_intel_linux and type ./make_fds.sh.

mcgratta commented 9 months ago

This is how I run a 2 process case on our cluster:

#!/bin/bash
#SBATCH -J junk
#SBATCH -e /home4/mcgratta/Test/junk.err
#SBATCH -o /home4/mcgratta/Test/junk.log
#SBATCH --partition=batch
#SBATCH --ntasks=2
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=2
#SBATCH --time=99-99:99:99
export OMP_NUM_THREADS=1
srun -N 1 -n 2 --ntasks-per-node 2 --mpi=pmi2 /home4/mcgratta/FDS/FDS6/bin junk.fds
JulietteFranque commented 9 months ago

Hi Kevin, I think that I finally got it to work - the only thing I added compared to my first attempt was "module load gcc" (still used the pre-compiled version of fds). Would you mind checking my output files to verify that it is running in parallel?

input:

#!/bin/bash

#SBATCH -J myjob          
#SBATCH -o myjob.o     
#SBATCH -p development    
#SBATCH -t 00:02:00       
#SBATCH -N 2       
#SBATCH -n 2             
#SBATCH -e myjob.e    # Direct error to the error file

#SBATCH --mail-type=all   
#SBATCH --mail-user=juliette.franqueville@utexas.edu
module load gcc
module list
export OMP_NUM_THREADS=1

ibrun fds  bus.fds

outputs

Currently Loaded Modules:
  1) cmake/3.24.2   3) xalt/2.10.32   5) gcc/11.2.0  7) impi/19.0.9
  2) pmix/3.2.3     4) TACC           6) python3/3.9.7

 Starting FDS ...

 MPI Process      0 started on c305-005.ls6.tacc.utexas.edu
 MPI Process      1 started on c305-006.ls6.tacc.utexas.edu

 Reading FDS input file ...

 Fire Dynamics Simulator

 Current Date     : November 28, 2023  19:27:38
 Revision         : FDS-6.8.0-0-g886e009-release
 Revision Date    : Tue Apr 18 07:06:40 2023 -0400
 Compiler         : ifort version 2021.7.1
 Compilation Date : Apr 18, 2023 15:20:17

 MPI Enabled;    Number of MPI Processes:   2
 OpenMP Disabled

 MPI version: 3.1
 MPI library version: Intel(R) MPI Library 2019 Update 9 for Linux* OS

 Job TITLE        :
 Job ID string    : bus

 Time Step:  1, Simulation Time:      0.16 s
 Time Step:  2, Simulation Time:      0.32 s
 Time Step:  3, Simulation Time:      0.48 s
 Time Step:  4, Simulation Time:      0.64 s
 Time Step:  5, Simulation Time:      0.80 s
 Time Step:  6, Simulation Time:      0.96 s
 Time Step:  7, Simulation Time:      1.12 s
 Time Step:  8, Simulation Time:      1.28 s
 Time Step:  9, Simulation Time:      1.44 s
 Time Step:     10, Simulation Time:      1.60 s
 Time Step:     20, Simulation Time:      3.19 s
 Time Step:     30, Simulation Time:      4.79 s
 Time Step:     40, Simulation Time:      6.39 s
 Time Step:     50, Simulation Time:      7.98 s
 Time Step:     60, Simulation Time:      9.58 s
 Time Step:     63, Simulation Time:     10.00 s

STOP: FDS completed successfully (CHID: bus)
TACC:  Starting up job 1366047
TACC:  Starting parallel tasks...
TACC:  Shutdown complete. Exiting.

fds output files

bus.binfo.txt bus_steps.csv bus.out.txt bus_git.txt bus_hrr.csv bus_cpu.csv bus.end.txt bus.sinfo.txt

Thank you both so much for your help!

mcgratta commented 9 months ago

That looks right. Weird that gcc has anything to do with it, but modules often have path or environment variable definitions that are needed to run an MPI job. It appears that your job spawned processes on two different nodes which is a good indicator that things are working. The thing to always check is the list of nodes that are being used. If you can login to the nodes that are running a job, you can do a "top" command to see a list of the processes and their memory and CPU usage.

I'll close the issue, but open it up again if you have problems, or create a new issue if it's unrelated.

JulietteFranque commented 9 months ago

Thanks!

On Wed, Nov 29, 2023 at 8:30 AM Kevin McGrattan @.***> wrote:

That looks right. Weird that gcc has anything to do with it, but modules often have path or environment variable definitions that are needed to run an MPI job. It appears that your job spawned processes on two different nodes which is a good indicator that things are working. The thing to always check is the list of nodes that are being used. If you can login to the nodes that are running a job, you can do a "top" command to see a list of the processes and their memory and CPU usage.

I'll close the issue, but open it up again if you have problems, or create a new issue if it's unrelated.

— Reply to this email directly, view it on GitHub https://github.com/firemodels/fds/issues/12248#issuecomment-1832005093, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANHPT66Y3IWCD2QH6KS7BPLYG5BG7AVCNFSM6AAAAAA76DV7WSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZSGAYDKMBZGM . You are receiving this because you authored the thread.Message ID: @.***>