firemodels / fds

Fire Dynamics Simulator
https://pages.nist.gov/fds-smv/
Other
664 stars 622 forks source link

Problems with resuming cluster simulations #1577

Closed gforney closed 9 years ago

gforney commented 9 years ago
Application Version:5.5.3
SVN Revision Number:7031
Compile Date:29 october 2010
Operating System:Windows 7 / windows vista business

Describe details of the issue below:

We have some problems with distributed parallel simulations. Sometimes our simulations
just stops during the nights and we therefore need to resume the simulation the following
day. The crash should not due to numerical instability since the CFL number is below
1. 

In the setup a restart file is written for every 30 s real time. So we should have
the necessary information to resume the cluster simulation but it is not the case.

We get two different errors when we want to resume the simulation.

The first one is as shown below:
Starting FDS: \\garbo\afd-681\MEDARB\mht\Aktive\runFDS\BP5_05\fds5_mpi_win64.exe...

Process   4 of  11 is running on WS20159.ramboll.ramboll-group.global.network
Process   7 of  11 is running on WS20159.ramboll.ramboll-group.global.network
Process   5 of  11 is running on WS20159.ramboll.ramboll-group.global.network
Process   3 of  11 is running on WS20159.ramboll.ramboll-group.global.network
Process   1 of  11 is running on WS20159.ramboll.ramboll-group.global.network
Process   0 of  11 is running on WS20159.ramboll.ramboll-group.global.network
Process   2 of  11 is running on WS20159.ramboll.ramboll-group.global.network
Process   6 of  11 is running on WS20159.ramboll.ramboll-group.global.network
Process  10 of  11 is running on W12195X64.ramboll.ramboll-group.global.network
Process   8 of  11 is running on W12195X64.ramboll.ramboll-group.global.network
Process   9 of  11 is running on W12195X64.ramboll.ramboll-group.global.network
Process  11 of  11 is running on W12195X64.ramboll.ramboll-group.global.network
Mesh   1 is assigned to Process   0
Mesh   2 is assigned to Process   1
Mesh   3 is assigned to Process   2
Mesh   4 is assigned to Process   3
Mesh   5 is assigned to Process   4
Mesh   6 is assigned to Process   5
Mesh   7 is assigned to Process   6
Mesh   8 is assigned to Process   7
Mesh   9 is assigned to Process   8
Mesh  10 is assigned to Process   9
Mesh  11 is assigned to Process  10
Mesh  12 is assigned to Process  11
Mesh  13 is assigned to Process  11
Mesh  14 is assigned to Process  11
Mesh  15 is assigned to Process  11
Mesh  16 is assigned to Process  11
Mesh  17 is assigned to Process  11
forrtl: severe (67): input statement requires too much data, unit 76, file \\garbo\afd-681\MEDARB\mht\Aktive\runFDS\BP5_05\GalleriA_BP5_05_0016.restart

Image              PC                Routine            Line        Source        

fds5_mpi_win64.ex  0000000140589BE8  Unknown               Unknown  Unknown
fds5_mpi_win64.ex  0000000140584CC9  Unknown               Unknown  Unknown
fds5_mpi_win64.ex  000000014053CA3D  Unknown               Unknown  Unknown
fds5_mpi_win64.ex  000000014051AE17  Unknown               Unknown  Unknown
fds5_mpi_win64.ex  000000014051A6E1  Unknown               Unknown  Unknown
fds5_mpi_win64.ex  0000000140507D4A  Unknown               Unknown  Unknown
fds5_mpi_win64.ex  0000000140505C59  Unknown               Unknown  Unknown
fds5_mpi_win64.ex  00000001402D6DCE  Unknown               Unknown  Unknown
fds5_mpi_win64.ex  00000001404A5853  Unknown               Unknown  Unknown
fds5_mpi_win64.ex  000000014059338C  Unknown               Unknown  Unknown
fds5_mpi_win64.ex  000000014056C055  Unknown               Unknown  Unknown
kernel32.dll       00000000771DBE3D  Unknown               Unknown  Unknown
ntdll.dll          0000000077316861  Unknown               Unknown  Unknown

job aborted:
rank: node: exit code[: error message]
0: ws20159: 123
1: ws20159: 123
2: ws20159: 123
3: ws20159: 123
4: ws20159: 123
5: ws20159: 123
6: ws20159: 123
7: ws20159: 123
8: w12195x64: 123
9: w12195x64: 123
10: w12195x64: 123
11: w12195x64: 67: process 11 exited without calling finalize

It should be able to continue the simulations since we write the restart file for every
30 s real time. 

The second one we can restart the simulation but after two – three time steeps it crashes
due to numerical instability. We have tried to adjust the CFL number criteria to CFL,min
= 0,1 and CFL,max =0,6 but it still crashes due to numerical instability.  

Are you aware of any other have this problems and how to fix them? I don’t know if
it something with the FDS code

The host machine is running windows 7 and the slaves are running vista business. Please
let me know if you want the log file or something else.

Original issue reported on code.google.com by togersen on 2012-03-28 07:14:25

gforney commented 9 years ago
We've noticed that restarts on Windows machines often fail due to the size of the restart
file. There is no ready fix for this. My advice is to not do restarts. Windows is not
the best platform for dedicated parallel processing. I believe there are too many extraneous
processes that can interupt parallel jobs. We use Linux clusters at NIST and VTT.

Original issue reported on code.google.com by mcgratta on 2012-03-28 11:50:41