firemodels / fds

Fire Dynamics Simulator
https://pages.nist.gov/fds-smv/
Other
628 stars 610 forks source link

Restart Timestep #2057

Closed gforney closed 9 years ago

gforney commented 9 years ago
Please complete the following lines...

FDS Version: 6.0.0
SVN Revision Number:
Compile Date:
Smokeview Version/Revision:
Operating System:

Describe details of the issue below:

I am having problems with an inconsistent time step. I am attempting to verify the
results of a simulation, one with a restart halfway through, and one without. The simulation
run with a restart included changes the timestep shortly after the restart occurs.
This results in one extra timestep during the restarted simulation, which makes it
impossible for the results, when compared to the other run, to match exactly. Is this
normal behavior?

Thanks

More Details. this issue dosn't seem to occur soley a function of restarts. Attached
is an input file, and graph of my results. 

Scenario:
1) input file is simulated for 2 seconds with no restarts
2) input file is simulated for 2 seconds with one restart at 1 sec
3) input file is simulated for 2 seconds with five restarts each one occurring .4 seconds
from the previous one. The first one occurs at .4 sec

The temperatures from the device in the center of the square plate above the burner
are graphed.

See attached files. 
input file: apcli-script.txt

The differences between the restarted and not restarted start to occur after the sharp
increase in temperature, around one second. not sure what is goind on here. using fds
6 x64 single threaded. 
Version: FDS 6.0.0; MPI Disabled; OpenMP Disabled
SVN Revision Number: 17279
Compile Date: Sun, 03 Nov 2013

Thanks

Original issue reported on code.google.com by randy.mcdermott on 2014-02-20 21:27:39


gforney commented 9 years ago
Any info about this issue?

Thanks

Original issue reported on code.google.com by errold32 on 2014-03-04 18:51:58

gforney commented 9 years ago
Have not had time to address it yet.

Original issue reported on code.google.com by randy.mcdermott on 2014-03-04 19:05:12

gforney commented 9 years ago
OK. Thanks.

Original issue reported on code.google.com by errold32 on 2014-03-04 19:10:09

gforney commented 9 years ago
There is no guarantee that a restart will continue the calculation in exactly the same
way as if no restart occurs. There are subtle changes in, for example, the radiation
calculation that might affect things a tiny bit. I do not see your results as surprising.
Have you noticed larger changes when restarting?

Original issue reported on code.google.com by mcgratta on 2014-03-05 22:28:15

gforney commented 9 years ago
Larger then what I have shown? No. The time step is also different after
the restart.

Original issue reported on code.google.com by errold32 on 2014-03-05 22:31:40

gforney commented 9 years ago
How do you know that the time step is different?

Original issue reported on code.google.com by mcgratta on 2014-03-06 18:01:30

gforney commented 9 years ago
I know the time-step is different from a few sources; the only one I have
to show you is the devc .csv files from the runs. in the *comparison.csv, i
have subtracted the time outputs from two runs. one with and one without a
restart. you can see that shortly after the restart, at T=1s, the output
times don't match up any more. You will also notice that there is an extra
"time, temp" entry for the run with a restart. This indicates to me that
there was an extra timestep, and thus differing time steps between runs.
This behaviour was also reflected in the terminal output during runs. The
times printed there were different as far as I can remember. I didn't save
that ...

Original issue reported on code.google.com by errold32 on 2014-03-11 20:42:44

gforney commented 9 years ago
I was concerned that the time step of the first step of the restarted case was different
than the time step that was saved. I think what you are telling me is that the time
steps are not the same in the two cases after a few time steps, which is not surprising
because the restart process changes slightly the numerical procedure. 

I will mark this case as "Won't fix". Just keep in mind that restarts are not exactly
the same as an unstopped calculation. If you find a case where the difference is noticeable,
we will reopen the case.

Original issue reported on code.google.com by mcgratta on 2014-03-11 22:06:27

gforney commented 9 years ago

Dear all,

as we have a maximal run time of 24h on our HPC systems, we strongly relay on the restart
functionality of FDS. In some of our FDS cases numerical instabilities are triggered
by a restart and therefore prevent a continuos FDS simulation.

As already stated in the above post, the changes in result might become significant
(or lead to numerical instabilities). Although I can not reproduce the instabilities,
which occur in our more complex simulations, the following test setup might ease the
debugging process. The attached PDF files illustrate the different solutions and a
simple benchmark setup. Obviously one might argue, that the whole system is chaotic
and therefore small errors in the restart procedure will always lead to different results;
that is "fine", it's about the triggered instabilities, which bother us.

Setup: 
- bench2 case, from the shipped verification set
- two meshes
- measured is the volume flow in the plume at 2.7m height
- multiple simulations, each having a __single__ restart at 4s (8s, 12s, ...); i.e.
each simulation is restarted once, e.g. after 4s, and then run up to the final simulation
time

Attached:
- PDF of the volume flow for each simulation
- PDF of the difference in volume flow w.r.t. the "unrestarted" case, linear scale
- PDF of the difference in volume flow w.r.t. the "unrestarted" case, logarithmic scale
- tar file with the automatic benchmark script, FDS input file template and analysis

How to run (only Linux/OSX systems):
- untar restart_test.tgz
- adopt the FDS executable, number of restarts, total run time in run_restart_test.sh
- you can change the mesh size in bench2.template to have a shorter runtime
- just execute the bash script (run_restart_test.sh)
- note: the analysis needs python+numpy+matplotlib+scipy

FDS version, compiled with GCC 4.8.1, no MPI:

Compilation Date : Wed, 20 Aug 2014
Current Date     : September 18, 2014  16:45:59
Version          : FDS 6.1.1
SVN Revision No. : 20286

Best,
Lukas

Original issue reported on code.google.com by hpc.on.fire on 2014-09-19 13:39:31


gforney commented 9 years ago
Lukas,

Kevin said he will take a look at this.  But, as you know, it is difficult to sort
these things out entirely unless we have a case that fails quickly.

In the interim, I am curious, have you ever tried adding PROJECTION=.TRUE. on MISC?
 One of the things that might happen when you restart is that in loading the velocity
field you have some imperfection in your divergence field.  This is even true if you
sample an analytical solution to incompressible Navier-Stokes at staggered locations
and then take the discrete divergence.  It will not be perfectly divergence free. 
The solution to this is to use a projection scheme to project the velocity field onto
a divergence free space.

The default scheme in FDS is not a "true projection".  It is mathematically equivalent
to a projection provided the initial divergence field is zero.  If it is not, the divergence
error is retained, propagates, and does who knows what.  To alleviate this problem
you can set the flag mentioned above.  Then a true projection is used on every time
step and the new velocity field will perfectly match the thermodynamic divergence.

R

Original issue reported on code.google.com by randy.mcdermott on 2014-09-19 15:18:33

gforney commented 9 years ago
I took a look at this issue. Randy is right -- we need a case that fails. We would need
to overhaul FDS in order to get a restarted case to be EXACTLY like a non-restart.
We have seen recently that even the exact same case with different OpenMP threads can
diverge slightly after a long run time. Same for a restart. Consider that both the
radiation and pyrolysis routines are not executed every time step. Radiation is executed
every third time step, at which time one-fifth of the solid angles are updated. Pyrolysis
occurs every 2 time steps, but default. Jobs are stopped and started in the middle
of these cycles, and it would be very difficult to restart at the exact stage in the
process. Instead, we just initialize all solid and gas phase cells to their last value
before the restart and go from there. This is not going to preserve the exact numerics.

But let's see if we can look at a case that fails. That would help us pin down whatever
is lacking in the restart.

Original issue reported on code.google.com by mcgratta on 2014-10-14 16:55:13

gforney commented 9 years ago
Dear Kevin & Randy,

I had just time to have a closer look into the restart issue. This are my current findings
(FDS revision 20794).

First, I agree with you, that a "failing" run would be good. And I can understand that
the radiation and pyrolysis steps are not executed every time. However, our cases run
for more then 20 hours and break down after a restart. This is not really practical.

Anyway, I just run the above setup (bench2.fds) and compared the data in the first
step after a restart with the one running without restarting. So far, I have found
two issues:

1) Computing the scalar face values in MASS_FINITE_DIFFERENCES @ mass.f90, starting
at line 64 seems not to take into account a restart. I.e. the face values must be computed
(again). Enforcing this helps to preserve exact data.

2) UVW_SAVE is not stored to disk, however used for calculations after a restart (in
MASS_FINITE_DIFFERENCES @ mass.f90). Adding it to the restart data helps to preserve
exact data.

So, having done both changes allows me to have exact numbers in both runs during the
predictor step. However, there are probably more issues like that (usage of uninitialised
values), as the numbers still start to diverge in the corrector step. 

If anyone has any idea which data might be causing issues like the above ones, let
me know, this would speed up debugging. Otherwise / Anyway I will have a closer look
in the corrector; I a convinced, that in this simple case it will reproduce the exact
numbers.

Btw, initialising numbers / arrays with nontrivial values (best: large random numbers)
shows you quickly where data is accessed before set to meaningful values.

So far,
Lukas

Original issue reported on code.google.com by hpc.on.fire on 2014-11-19 17:50:23

gforney commented 9 years ago
I added UVW_SAVE to the restart file. 

Randy -- could you look at Number 1 in Comment 12.

Original issue reported on code.google.com by mcgratta on 2014-11-19 19:01:45

gforney commented 9 years ago
I took care of #1 in Comment 12. Randy please check.

Original issue reported on code.google.com by mcgratta on 2014-11-19 20:02:44

gforney commented 9 years ago
Thanks for the quick src code update.

I have figured out which other fields are used after a restart, but are not consistent:

US, VS, WS
DS, HS, 
D_REACTION
U_GHOST, V_GHOST, W_GHOST
DS_CORR

My current hack is to dump these fields, in addition to UVW_SAVE. However, I have seen,
that some of them might be (re-)computed / corrected; just dumping them might be more
efficient here as the size of the restart files are not an issue.

Current status:
a) Having done this fix -- and turning off radiation -- I do not see (measure) any
difference between restarted simulations. See boring graph attached.
b) However, with radiation on, I still see differences, but much smaller, see attachments.
This might be related to the stepping of radiation.
c) This will probably / hopefully solve our crashes.

Note: If you are interested in the places, where the above mentioned fields are causing
issues, let me know. Otherwise we keep it simple.

Best,
Lukas

Original issue reported on code.google.com by hpc.on.fire on 2014-11-20 23:35:13


gforney commented 9 years ago
I will check these variables. BTW, variables like US mean "u star"; that is, they are
calculated during the first (predictor) stage of the time step. US is a first order
accurate estimate of the u component of velocity at the next time step. During the
second half (corrector) stage of the time step, U and US are used to calculate U at
the next time step. So the "S" variables are intermediate values that should not need
to be saved. At least, that is the way it was originally, but it is possible that now
somehow they are used at the next time step. I'll check.

Original issue reported on code.google.com by mcgratta on 2014-11-21 13:50:13

gforney commented 9 years ago
Yes, I see now how these "S" variables do play a role in the next time step and do need
to be saved. I added all the variables in Comment #15 to the restart variable list.
SVN 21038.

Original issue reported on code.google.com by mcgratta on 2014-11-21 14:49:04

gforney commented 9 years ago
Kevin,

thanks again for the repository update. 

There is still the issue with radiation. Assuming the different stepping of radiation
and pyrolysis is the central problem, it could be solved by stopping FDS only when
all processes are synchronous, i.e. after a full cycle. It would be more a clean shutdown
for a following restart.  

Would that be an option? If yes, shall I post you a patch proposal?

Best,
Lukas

Original issue reported on code.google.com by hpc.on.fire on 2014-11-21 20:00:55

gforney commented 9 years ago
I just created a flag called RADIATION_COMPLETED. Only when it is .TRUE. will a restart
file get dumped. SVN 21043.

Original issue reported on code.google.com by mcgratta on 2014-11-21 21:19:08

gforney commented 9 years ago
Kevin,

the above applied patches seem to work for us, i.e. we do not see crashes in the long
running simulations. At least so far.

However, the inclusion of RADIATION_COMPLETED is more complex. We use the fact, that
a restart dump is written, as a stop file is detected -- it gets created, when the
reserved wall clock time (max 24h) is approaches. FDS stops immediately and in general
the radiation solver is not finished yet, i.e. RADIATION_COMPLETED is false. 

Bypassing the setting of *STOP_STATUS to USER_STOP, w.r.t. to the existence of the
stop file AND the completion of the radiation solver is not trivial in the current
"stop-management-logic": RADIATION_COMPLETED gets set to TRUE at the end of a time
step, however the check for the stop file is done at the beginning. This is inconsistent.

Do you want me to propose a patch?

Best,
Lukas

Original issue reported on code.google.com by hpc.on.fire on 2014-12-05 10:33:34

gforney commented 9 years ago
I'll take a look, but if you have an idea, let me know.

Original issue reported on code.google.com by mcgratta on 2014-12-05 14:33:50

gforney commented 9 years ago
The STOP_STATUS logic has become too confusing. I am going to clean it up. It should
be done early next week. I will make sure that the RADIATION_COMPLETED logic works.

Original issue reported on code.google.com by mcgratta on 2014-12-05 22:32:28

gforney commented 9 years ago
Kevin,

thanks for your effort. It would probably make more sense to have a general execution
status and not be tied to just one status, the stop condition. This might make the
logic less stiff.

Best,
Lukas

Original issue reported on code.google.com by hpc.on.fire on 2014-12-07 20:25:49

gforney commented 9 years ago
I committed code that simplifies STOP_STATUS (SVN 21150). Now, there is only the single
global parameter (which is MPI exchanged in a single subroutine). The radiation routine
should complete its full cycle now. Let me know if it works.

Original issue reported on code.google.com by mcgratta on 2014-12-09 22:50:12

gforney commented 9 years ago
Kevin, 

thanks again. A first quick test worked well!

Best,
Lukas

Original issue reported on code.google.com by hpc.on.fire on 2014-12-09 23:04:52

gforney commented 9 years ago
I'll keep the issue open until we ensure things are working properly. I would guess
that there is going to be some rare instance when two things go wrong at the same time,
but we'll handle that later.

One thing to note -- the radiation routine fully updates every 15 time steps. 1/5 of
the angles are updated once out of every 3 time steps. I expected shutdowns to occur
at time step n*15, but they actually occur at time step 13 + 15*n. 

Original issue reported on code.google.com by mcgratta on 2014-12-10 13:24:55

gforney commented 9 years ago
I will close this issue. If something goes wrong with the new stop status, open up a
new issue.

Original issue reported on code.google.com by mcgratta on 2015-02-06 18:38:28