firemodels / fds

Fire Dynamics Simulator
https://pages.nist.gov/fds-smv/
Other
668 stars 625 forks source link

MPI job hanging #13304

Closed drjfloyd closed 3 months ago

drjfloyd commented 3 months ago

A couple of weeks ago the attached case ran fine. Now it hangs. I put in some debug and it showed 10 of the 108 processes were hanging in this line in main.f90 in CALCULATE_RTE_SOURCE_CORRECTION_FACTOR

CALL MPI_ALLREDUCE(RAD_Q_SUM,RAD_Q_SUM_ALL,1,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,IERR)

The 10 processes were running on the same node of our cluster and are all the processes assigned to just that node. Other FDS jobs are running over multiple nodes haven’t hung. Also it doesn’t happen right away. It took about 18000 timesteps (~0.3 s). Other variants of this case also now hang (different diluent gases) but at different times (up to ~1.8 s). They would have necessarily had the same node allcoation.

This is a 108 core case using CVODE run as qfds.sh -p 180

As putting this issue toghether occured to me maybe related to the qfds.sh change. So just launched again running as qfds.sh -p 108 -n 54 as was done a couple of weeks ago. Will update the issue with what happens with that.

cup_dns_c3h8_co2_p5_a.txt

mcgratta commented 3 months ago

The case I am running is spread out over 3 64 core nodes -- 13 + 64 + 31

drjfloyd commented 3 months ago

The one I started as 54 cores each on two 64 core nodes is still going ¯_(ツ)_/¯

mcgratta commented 3 months ago

My job is still going at 9.2 s.

drjfloyd commented 3 months ago

Thanks. The job I started as -p 108 -n 54 is still running fine. Might be some issue with our machine.

mcgratta commented 3 months ago

Still going

 Time Step: 1597100, Simulation Time: 14.863495 s
drjfloyd commented 3 months ago

Thanks. I think this is pointing to maybe something on our end? Not sure what. Other cases running over multiple nodes with the new qfds run OK and where it was dying is a tiny amount of data transfer.

mcgratta commented 3 months ago

I stopped the job on our cluster because we were testing bigger fish. I'm going to close this issue, but if another job consistently hangs. let's open a new one.