firemodels / fds

Fire Dynamics Simulator
https://pages.nist.gov/fds-smv/
Other
650 stars 618 forks source link

FDS+Evac verification case fails if run with 1 MPI process #2251

Closed gforney closed 9 years ago

gforney commented 9 years ago
We would like for all cases to run with the MPI version of FDS, even if the case runs
now with the serial version. The test case

Verification/evac_smv_testcase2.fds

fails with an error if run with the MPI executable and number of MPI processes set
to 1. I see why the error occurs, but cannot figure out a way to prevent it without
causing a larger problem.

Original issue reported on code.google.com by mcgratta on 2014-11-23 21:21:06

gforney commented 9 years ago
Ok, I'll take a look of this. Might be that I have not updated
the evaucation .fds files lately. I just got the latest SVN
source code today and ran it using my simple test case, where
I check that:

 serial fds fire = serial fds+evac (for fire outputs)
 serial fds+evac = mpi fds+evac (for fire + evac outputs)
 serial fds+evac = openmp fds+evac (serial fds+evac means
                   that I force just one core)

Wbr,
Timo

Original issue reported on code.google.com by tkorhon1 on 2014-12-18 15:01:01

gforney commented 9 years ago
Well, I updated the input files for the evacuation verification cases.
Now they should work for serial  SVN Revision No. : 21192. Next I'll
check the parallel version.

Ok, now I see the problem. I'll fix this. Something has been
changed lately in the source code...

Timo

Original issue reported on code.google.com by tkorhon1 on 2014-12-19 14:15:32

gforney commented 9 years ago
I made the changes in read.f90 (Revision: 21212).

Variables NMESHES_EVAC, NMESHES_FIRE, NM_EVAC are added to READ_MESH.
These are counters that I need now. I can not yet use EVACUATION_ONLY
nor EVACUATION_GRID arrays because they are not yet having correct
values. They are initialized properly for the meshes later (well,
at least for some additional evacuation meshes, like STRS meshes etc).

Changes to two checks needed to be done, first was:

! Stop the calculation if the number of MPI processes is greater than the number of
meshes

IF (NO_EVACUATION) THEN
   IF (NMESHES<NUMPROCS) THEN
      CALL MPI_FINALIZE(IERR)
      WRITE(MESSAGE,'(A,I3,A,I3)') 'ERROR: The number of MPI processes, ',NUMPROCS,',
exceeds the number of meshes, ',NMESHES
      CALL SHUTDOWN(MESSAGE) ; RETURN
   ENDIF
ELSE
   IF(NMESHES_FIRE+1<NUMPROCS) THEN
      CALL MPI_FINALIZE(IERR)
      WRITE(MESSAGE,'(A,I3,A,I3)') 'ERROR: The number of MPI processes, ',NUMPROCS,&
           ', exceeds the number of fire meshes + 1, ',NMESHES_FIRE+1
      CALL SHUTDOWN(MESSAGE) ; RETURN
   ENDIF
ENDIF

Here I was using NO_EVACUATION to be sure that the plain fire
calculation is not broken.

Second change was:

! Check for bad mesh ordering if MPI_PROCESS used

IF (USE_MPI) THEN
   DO NM=1,NMESHES
      IF (NM==1) CYCLE
      IF (EVACUATION_ONLY(NM)) CYCLE
      IF (PROCESS(NM) < PROCESS(NM-1)) THEN
         WRITE(MESSAGE,'(A,I3,A,I3,A)') 'ERROR: MPI_PROCESS for MESH ', NM,' < MPI_PROCESS
for MESH ',NM-1,&
                                        '. Reorder MESH lines.'
         CALL SHUTDOWN(MESSAGE) ; RETURN         
      ENDIF
   ENDDO
   DO NM=1,NMESHES
      IF (NM==1 .OR. .NOT.EVACUATION_ONLY(NM)) CYCLE
      IF (.NOT.EVACUATION_GRID(NM)) CYCLE
      IF (PROCESS(NM) < PROCESS(NM-1)) THEN
         WRITE(MESSAGE,'(A,I3,A,I3,A)') 'ERROR: MPI_PROCESS for evacuation MESH ',
NM,' < MPI_PROCESS for MESH ',NM-1,&
                                        '. Reorder MESH lines.'
         CALL SHUTDOWN(MESSAGE) ; RETURN         
      ENDIF
   ENDDO
ENDIF

This works because I have added following lines somewhere so
that just fire meshes have evacuation_only as false:

CALL ChkMemErr('READ','EVACUATION_ONLY',IZERO)
EVACUATION_ONLY(1:NMESHES_FIRE) = .FALSE.
IF (NMESHES_FIRE<NMESHES) EVACUATION_ONLY(NMESHES_FIRE+1:NMESHES) = .TRUE.

TimoK

Original issue reported on code.google.com by tkorhon1 on 2014-12-19 15:46:23

gforney commented 9 years ago
I'm not sure how this is related, but in any case, when I run the MPI version of FDS
in debug mode:

Evacuation/evac_smv_testcase1.err:forrtl: severe (408): fort: (8): Attempt to fetch
from allocatable variable EXTERIOR when it is not allocated

Original issue reported on code.google.com by mcgratta on 2014-12-19 17:54:03

gforney commented 9 years ago
Ok, I'll check this. This is probably some other issue, but let's fix
this using this issue. Might be that I do not set EXTERIOR for 
my additional evacuation meshes (STRS meshes and visibility meshes).
These meshes are there just to store some values (U and V and obst
info at different level), so no flow solver is used for these meshes.
But I should check that enough variables are allocated (even if they
are not actually used, just initialized).

TimoK

Original issue reported on code.google.com by tkorhon1 on 2014-12-22 08:15:07

gforney commented 9 years ago

NewSmokey (~/FDS+Evac/FDS+Evac_Tests/FDS_ScriptRuns/Evacuation): sh Check_Evacuation_MPI.sh

evac_smv_testcase1 using 3 processes DEBUG

And the script that runs the case:

#!/bin/bash
export OMP_NUM_THREADS=1
MPI_DIR=/share/apps/OpenMPI/openmpi-1.8.3_intel/bin
FDS_DIR=/home/rtetko/FDS_Source_Latest

echo evac_smv_testcase1 using 3 processes DEBUG 
${MPI_DIR}/mpirun -bind-to none -np 3 ${FDS_DIR}/fds_mpi_intel_linux_64_db evac_smv_testcase1.fds
> evac_smv_testcase1.err 2>&1

And tail evac_smv_testcase1.err:
 Time Step:    200,    Simulation Time:      9.96 s
 Time Step:    201,    Simulation Time:     10.01 s

STOP: FDS completed successfully (CHID: evac_smv_testcase1)

And the openmpi that I use:

NewSmokey: which mpirun
/share/apps/OpenMPI/openmpi-1.8.3_intel/bin/mpirun

NewSmokey: which mpifort
/share/apps/OpenMPI/openmpi-1.8.3_intel/bin/mpifort

NewSmokey: which ifort
/share/apps/intel/composer_xe_2013_sp1.0.080/composer_xe_2013_sp1.0.080/bin/intel64/ifort

And here is the version info that I ran:

Fire Dynamics Simulator

Compilation Date : Sat, 20 Dec 2014
Current Date     : December 22, 2014  10:47:32
Version          : FDS 6.1.2
SVN Revision No. : 21221

MPI Enabled; Number of MPI Processes:     3
OpenMP Enabled; Number of OpenMP Threads:   1

Job TITLE        : Test 1, fire+evac meshes
Job ID string    : evac_smv_testcase1

And the fds input file should be the same as in the svn folder:

Omakone (/cygdrive/C/D/FdsEvac/Fds+Evac5_SMV): diff /cygdrive/C/fds-smv/Ve
rification/Evacuation/evac_smv_testcase1.fds /cygdrive/Y/FDS+Evac/FDS+Evac_Tests
/FDS_ScriptRuns/Evacuation/evac_smv_testcase1.fds
Omakone (/cygdrive/C/D/FdsEvac/Fds+Evac5_SMV):

("Omakone" = my own machine, C-drive has the SVN version, Y-drive is
the Linux cluster disk, where I run the case)

Wbr,
Timo

Original issue reported on code.google.com by tkorhon1 on 2014-12-22 09:07:19

gforney commented 9 years ago
The case ran successfully with 1 MPI process and 1 OpenMP thread. Thanks.

Original issue reported on code.google.com by mcgratta on 2014-12-29 15:02:07

gforney commented 9 years ago
I must have made a mistake when running this case. There is still a problem. The problem
does not occur if you use 3 MPI processes. It occurs when you use 1. My comment in
#7 was wrong because if you run the case with 1 MPI process, these lines in read.f90
force EXTERIOR to be deallocated when MYID==EVAC_PROCESS. When you use only one MPI
process, MYID and EVAC_PROCESS will both be zero:

   ! EVAC_PROCESS needs only the allocations of SOLID, CELL_INDEX, OBST_INDEX_C, OBSTRUCTION,
and I,J,K_CELL

   IF (.NOT.EVACUATION_ONLY(NM) .AND. MYID==EVAC_PROCESS) THEN
      DEALLOCATE(M%EXTERIOR)
      CYCLE MESH_LOOP_3
   ENDIF

We must assume that these cases can be run with 1, 2, 3,..., or NMESHES MPI processes.

Original issue reported on code.google.com by mcgratta on 2014-12-31 16:39:01

gforney commented 9 years ago
Ok, I'll fix the logic for MPI runs. Now I have assumed that
fire meshes are using different MPI process than the evacuation
meshes. Evacuation meshes are all in the same MPI process.
But, as I see, there can also be fire meshes in this MPI
process, if MPI is true and one uses just one MPI process.

So, I check all the MYID==EVAC_PROCESS). Or I do somewhere
a check that there is no fire meshes in the EVAC_PROCESS and
stop the run and say "Evacuation process should not contain
any fire meshes". Or something similar.

Happy new year!
Timo

Original issue reported on code.google.com by tkorhon1 on 2015-01-02 07:59:06

gforney commented 9 years ago
Is there a reason that all Evac meshes share the same MPI process, which should be different
than the flow MPI process? I would prefer that the number of MPI processes be flexible.

Original issue reported on code.google.com by mcgratta on 2015-01-02 14:34:45

gforney commented 9 years ago
All evacuation meshes should be in the same MPI process. The evacuation
meshes need information from each other. But there is no reason that
there can not be any other (i.e., fire) mesh in the same MPI process.
I have just assumed that there is no other than evacuation meshes
in the "evacuation process". To correct this bad assumption, I have
just to check some IF-statements. Somewhere I check if the PROCESS
is an evacuation process => skip fire related meshes. And vice versa.
I need to put here a better logic. 

Well, if there would be fire and evacuation meshes in the same process,
then I would assume that the number of MPI processes is equal to one.
And this means that no MPI calls are needed, all meshes see each other
directly (same process).

Well, the first remedy is to check that if the MPI = true, then
there should be at least two processes, if a fire+evac calculation
and check that evacuation process is different from the fire process.

WBR,Timo

Original issue reported on code.google.com by tkorhon1 on 2015-01-05 09:03:35

gforney commented 9 years ago
OK, just make sure that Evac cases can be run with only one MPI process, even if the
MPI version of FDS is used. 

Original issue reported on code.google.com by mcgratta on 2015-01-05 13:41:52

gforney commented 9 years ago
Issue 2349 has been merged into this issue.

Original issue reported on code.google.com by tkorhon1 on 2015-02-11 09:28:44

gforney commented 9 years ago
Issue 2331 has been merged into this issue.

Original issue reported on code.google.com by tkorhon1 on 2015-02-11 09:33:58

gforney commented 9 years ago
Kevin, would you check that the following bug fix is not
doing any harm to fire calculation:

   !IF (EXCHANGE_RADIATION) CALL MESH_EXCHANGE(2)
   IF (EXCHANGE_RADIATION) THEN
      CALL MESH_EXCHANGE(2)
   ELSE
      IF (RADIATION) CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
   END IF

This way all processes will always call MPI_BARRIER. The original
version just called the barrier, if the radiation was updated at
this time step. Evacuation process does not know if the fire meshes
update radiation or not, so I added barrier to all cases.

If you do not like the above IF statement, the alternative would be:

   main-loop:
   CALL MESH_EXCHANGE(2)

mesh_exchange subroutine, somewhere at the beginning:
IF (CODE==2 .AND. .NOT. EXCHANGE_RADIATION) THEN
   CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
   RETURN
END IF

Here the problem is that the initialization part of main.f90 has:

! Iterate surface BCs and radiation in case temperatures are not initialized to ambient

DO I=1,NUMBER_INITIAL_ITERATIONS
   DO NM=1,NMESHES
      IF (PROCESS(NM)/=MYID) CYCLE
      CALL WALL_BC(T_BEGIN,NM)
      IF (RADIATION) CALL COMPUTE_RADIATION(T_BEGIN,NM)
   ENDDO
   IF (RADIATION) CALL MESH_EXCHANGE(2) ! Exchange radiation intensity at interpolated
boundaries
ENDDO

And I should see that the logic goes nicely here also. All processes
call mesh_exchange at this initialization phase, RADIATION=T also
for the evacuation process (I think so). If EXCHANGE_RADIATION is
set already to true for the fire processes, then the above version
should work, otherwise something should still be done.

I have tested the first version of the bug fix. If you say that
this fix is good enough then I'll commit it.

Timo

Original issue reported on code.google.com by tkorhon1 on 2015-02-11 15:33:34

gforney commented 9 years ago
A call to MPI_BARRIER forces each MPI process to wait at this point until all processes
arrive. In general, we do not like to do this because it adds to the inefficiency.
If you need to do an MPI_BARRIER, then make it happen only when an Evac calculation
is done. I don't want a non-Evac case to stop at a barrier if it doesn't have to.

Original issue reported on code.google.com by mcgratta on 2015-02-11 15:50:18

gforney commented 9 years ago
Ok, I'll just add the mpi barrier to fire+evacuation meshes case.
And I'll hide it inside the mesh_exchange subroutine, so that
the main routine is easy to read.

Timo

Original issue reported on code.google.com by tkorhon1 on 2015-02-12 11:12:30

gforney commented 9 years ago
Things are fixed, svn 21734.

At the beginning of MESH_EXCHANGE subroutine:

IF (CODE==2) THEN ! Check the need for the radiation exchange 
   IF (ANY(EVACUATION_ONLY) .AND. N_MPI_PROCESSES>1 .AND. RADIATION) THEN
      ! Set barrier for fire processes, because evacuation process has a barrier in
every ICYC
      IF (.NOT.EXCHANGE_RADIATION .AND. MYID/=EVAC_PROCESS) THEN
         CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
         RETURN
      END IF
   ELSE ! No evacuation or just one process or no radiation
      IF (.NOT.EXCHANGE_RADIATION .OR. .NOT.RADIATION) RETURN
   END IF
END IF

And the calls "MESH_EXCHANGE(2)" in the main time loop and
initialization are just like

    CALL MESH_EXCHANGE(2) ! Exchange radiation intensity at interpolated boundaries

e.g., no "IF (RADIATION) CALL MESH_EXCHANGE(2)". Radiation/need
for rad. mesh exchange is tested at the start of the called
subprogramme.

Timo

Original issue reported on code.google.com by tkorhon1 on 2015-02-13 11:51:57

gforney commented 9 years ago
I added the test cases back to the verification list. Thanks.

Original issue reported on code.google.com by mcgratta on 2015-02-13 14:05:20