Closed drjfloyd closed 3 months ago
The case I am running is spread out over 3 64 core nodes -- 13 + 64 + 31
The one I started as 54 cores each on two 64 core nodes is still going ¯_(ツ)_/¯
My job is still going at 9.2 s.
Thanks. The job I started as -p 108 -n 54 is still running fine. Might be some issue with our machine.
Still going
Time Step: 1597100, Simulation Time: 14.863495 s
Thanks. I think this is pointing to maybe something on our end? Not sure what. Other cases running over multiple nodes with the new qfds run OK and where it was dying is a tiny amount of data transfer.
I stopped the job on our cluster because we were testing bigger fish. I'm going to close this issue, but if another job consistently hangs. let's open a new one.
A couple of weeks ago the attached case ran fine. Now it hangs. I put in some debug and it showed 10 of the 108 processes were hanging in this line in main.f90 in CALCULATE_RTE_SOURCE_CORRECTION_FACTOR
CALL MPI_ALLREDUCE(RAD_Q_SUM,RAD_Q_SUM_ALL,1,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,IERR)
The 10 processes were running on the same node of our cluster and are all the processes assigned to just that node. Other FDS jobs are running over multiple nodes haven’t hung. Also it doesn’t happen right away. It took about 18000 timesteps (~0.3 s). Other variants of this case also now hang (different diluent gases) but at different times (up to ~1.8 s). They would have necessarily had the same node allcoation.
This is a 108 core case using CVODE run as qfds.sh -p 180
As putting this issue toghether occured to me maybe related to the qfds.sh change. So just launched again running as qfds.sh -p 108 -n 54 as was done a couple of weeks ago. Will update the issue with what happens with that.
cup_dns_c3h8_co2_p5_a.txt