firemodels / fds

Fire Dynamics Simulator
https://pages.nist.gov/fds-smv/
Other
641 stars 614 forks source link

ht3d_sphere_96 fails due to mpi timeout on particles #12567

Closed johodges closed 6 months ago

johodges commented 6 months ago

Describe the bug I was running firebot on our cluster last night and had a clean run other than an error in stage5 (release verification). I am running the case on a single node with a lot of oversubscription (8 cores for the 64 processes). Interestingly it failed before the first time step, but it was able to complete 1 time step during the debug verification.

I was also a bit surprised to see it flag a timeout in particles since I do not see any particles defined in the input file. I can increase the MPI_TIMEOUT as suggested here https://github.com/firemodels/fds/issues/11322 which may fix the issue, but I was curious if the particles aspect was pointing to an underlying issue.

Fire Dynamics Simulator

Current Date : February 27, 2024 04:36:17 Revision : FDS-6.8.0-1658-g66a99ae-jh-firebot Revision Date : Mon Feb 26 18:56:56 2024 -0500 Compiler : Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.11.1 Build 20231117_000000 Compilation Date : Feb 26, 2024 18:33:56

Number of MPI Processes: 64

MPI version: 3.1 MPI library version: Intel(R) MPI Library 2021.11 for Linux* OS

Job TITLE : Heat transfer in solid sphere Job ID string : ht3d_sphere_96

ERROR: MPI exchange of particles timed out for MPI process 11 running on n0009. FDS will abort.

Desktop (please complete the following information):

mcgratta commented 6 months ago

I'll take a look.

johodges commented 6 months ago

The plot thickens. I submitted the same job (on a better node) and it was able to initialize and ran 80 time steps but then died with a bad termination and no other warnings/errors.

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 63 PID 39127 RUNNING AT n0016 = KILLED BY SIGNAL: 9 (Killed)

mcgratta commented 6 months ago

Run it in debug model

johodges commented 6 months ago

I think it was a memory issue. The node has ~50gb of memory and the job was trying to allocate more than that. I submitted it to a node with more memory and the job is using ~150gb of memory. That seems like a lot of memory to me for ~900k cells with SOLID_PHASE_ONLY=T. Does the CELL_SIZE need to be 0.0025 for this verification case to accomplish the objective?

mcgratta commented 6 months ago

The case is a big sphere with a 3D heat conduction calculation, which requires a 1-D grid at each surface cell, and each interior node stores info about the other 2 coordinate directions. So it is a memory hog. This is part of a convergence study, so the objective is to demonstrate converge at fine resolution.

You need a bigger boat.

mcgratta commented 6 months ago

The original error message needs to be fixed. The MPI timeout did not involve particles, but rather surface/solid phase info.

johodges commented 6 months ago

Thanks for the info. I can modify the version of qfds I am using to specify the memory required by the node to make sure it gets assigned to one of our larger nodes.

mcgratta commented 6 months ago

Or you can get a bigger boat ;)