firemodels / fds

Fire Dynamics Simulator
https://pages.nist.gov/fds-smv/
Other
672 stars 626 forks source link

REQ1 timed out for MPI process_________Fatal error in PMPI_Startall #11109

Closed jfpe37 closed 2 years ago

jfpe37 commented 2 years ago

Hello everyone, I am trying to run a simulation and gets the attached error message every time at the same time. I am using on the newest version of FDS 6.7.9. I found that the time when the simulation hangs is the time of opening the first SPRK . In addition, on the computer where I get the error, I ran the same file without activating the sprinklers and I don't get the error. I have also run the simulation on another computer with an older FDS version and there is no error but the run is very slow. Does anyone know the reason for getting the error? Is it really related to sprinklers? Is my SPRK definition correct? I would appreciate your help Capture Capture1 semel.txt

mcgratta commented 2 years ago

I will run the case.

mcgratta commented 2 years ago

Try adding this line to the input file:

&MISC MPI_TIMEOUT=300 /

If it works, I will explain why. If it doesn't, we'll try something else.

jfpe37 commented 2 years ago

Hןת it doesn't work

Capture2

mcgratta commented 2 years ago

OK. The problem is that the sprinkler activations and droplet initialization in MESH 5 (MPI process 4) are taking so much time that the other processes have moved on to the next phase of the time step and overlapped the existing MPI communication. I'll see if I can prevent this. You might want to try offsetting your sprinkler activation times to limit the amount of work done in a single time step.

mcgratta commented 2 years ago

Plan B. Remove the parameter MPI_TIMEOUT and add on the MISC line

&MISC PROFILING=T /

This parameter will allow the MPI communication an infinite amount of time to complete. Your case is running for me, but very slowly.

mcgratta commented 2 years ago

Setting PROFILING=T is working for me on a test case. I took the HVAC lines out of the input file to eliminate that as a cause of the problem. Let me know if it works for you.

jfpe37 commented 2 years ago

Thanks I ran with &MISC PROFILING=T / line, the file reached the same time step that usually got stuck and passed it in one time step. It is currently stuck but i still haven't received an error message. Is there another possible solution? Thanks for the help Capture3

mcgratta commented 2 years ago

It took 12 hours for the calculation to progress from 100 to 200 time steps. There are about 500000 droplets in mesh 5. Try reducing the number of droplets. Also, why MONODISPERSE? Check if this affects the outcome or timing.

jfpe37 commented 2 years ago

Hi, thank you very much for all the help I ran the simulation with the &MISC PROFILING=T / line as you suggested, and I also increased the number of MESH from 6 to 10 (in the area where ther was lots of particles) In addition i have changed the parameters of the droplet to these parameters:

&PART ID='Water', SPEC_ID='WATER VAPOR', DIAMETER=500.0, AGE=5.0, SAMPLING_FACTOR=10/ The simulation is progressing at a good rate Are these reasonable parameters for sprinklers when we are trying to create a flooding situation? And in addition, can you explain to me what the command &MISC PROFILING=T says? what is the meaning of this?

mcgratta commented 2 years ago

AGE may not matter. By default, FDS removes water droplets that reach the lower boundary of the domain, even if it is a solid floor. You can change this behavior by setting POROUS_FLOOR=F on the MISC line. SAMPLING_FACTOR just affects the number of droplets written out to a file for Smokeview. I don't think this matters too much, but it does save on I/O and Smokeview loading times. Removing MONODISPERSE might have helped, but I'm not exactly sure why. Do you know which parameter change made the greatest difference?

As far as PROFILING=T --- we added this for when we are profiling the code; that is, determining how much CPU time is used by the various subroutines. When we set PROFILING=T, we do not put a time limit on the length of time allocated for MPI communication. In effect, we do not stop the code if it appears to be taking too long to finish an MPI communication. In your case, this was helpful, but it wasn't the original intent of this parameter.

jfpe37 commented 2 years ago

Hi Kevin, I'm not sure which of the parameters had the main effect AGE or SAMPLING_FACTOR , but in my opinion the main effect was the addition of MESH's in the area where there were many particles. I actually split MESH #5 that keeps getting stuck into 3 smaller MESHs that now contain a smaller amount of particles Thank you very much for your help, Mor

mcgratta commented 2 years ago

Yes, that makes sense, assuming you have enough cores to process the meshes in parallel.

I'm going to keep this issue open to remind myself to do something about the MPI_TIMEOUT.

mcgratta commented 2 years ago

I increased MPI_TIMEOUT and now the code aborts if it hits this limit rather than have the other processes run too far ahead and crash.