Closed GoogleCodeExporter closed 9 years ago
we'll take a look at it
Original comment by gfor...@gmail.com
on 24 Sep 2008 at 12:15
I have run the case past the point of failure on my 32 bit linux cluster using
version 5.2.1. I do not have a 64 bit Windows platform to test on. Try running
the
case using 5.2.1, and report whether you are using the executables that are
distributed via the website, or if you have compiled your own version.
Original comment by mcgra...@gmail.com
on 30 Sep 2008 at 1:02
Hello all I tried to run the case of Iker on our 64bit Linux Machine (SES 10),
with
our self-compiled FDS version (intel compiler):
Compilation Date : Mit, 13 Aug 2008
Version : 5.2.0 Parallel
SVN Revision No. : 2166
I got the following output:
Job TITLE : ATRIO_CENTRAL_2
Job ID string : ATRIO_CENTRAL_2
Time Step: 1, Simulation Time: 0.06 s
Time Step: 2, Simulation Time: 0.11 s
Time Step: 3, Simulation Time: 0.15 s
Time Step: 4, Simulation Time: 0.18 s
Time Step: 5, Simulation Time: 0.21 s
Time Step: 6, Simulation Time: 0.23 s
Time Step: 7, Simulation Time: 0.25 s
Time Step: 8, Simulation Time: 0.27 s
Time Step: 9, Simulation Time: 0.29 s
Time Step: 10, Simulation Time: 0.30 s
Time Step: 20, Simulation Time: 0.44 s
Time Step: 30, Simulation Time: 0.54 s
Time Step: 40, Simulation Time: 0.63 s
Time Step: 50, Simulation Time: 0.70 s
Time Step: 60, Simulation Time: 0.78 s
Time Step: 70, Simulation Time: 0.84 s
Time Step: 80, Simulation Time: 0.91 s
Time Step: 90, Simulation Time: 0.97 s
Time Step: 100, Simulation Time: 1.03 s
Time Step: 200, Simulation Time: 1.54 s
Time Step: 300, Simulation Time: 2.00 s
Time Step: 400, Simulation Time: 2.44 s
Time Step: 500, Simulation Time: 2.88 s
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
fds5_mpi_intel 000000000050939B Unknown Unknown Unknown
fds5_mpi_intel 0000000000506DC1 Unknown Unknown Unknown
fds5_mpi_intel 0000000000820CCB Unknown Unknown Unknown
fds5_mpi_intel 00000000004046E2 Unknown Unknown Unknown
libc.so.6 00002AAC8E76B154 Unknown Unknown Unknown
fds5_mpi_intel 0000000000404629 Unknown Unknown Unknown
rank 3 in job 2 CFD-Workstation_9015 caused collective abort of all ranks
exit status of rank 3: return code 174
Original comment by simon.f...@hbi.ch
on 1 Oct 2008 at 1:30
http://groups.google.com/group/fds-smv/browse_thread/thread/2b32dc7907d34b9
discusses difficulties compiling a 64 bit FDS executable. I suspect that these
problems you are having might have to do with either the stack size or floating
point underflows. Could you read the above Discussion thread and tell me if
anything
helps. Also, use 5.2.1
Original comment by mcgra...@gmail.com
on 1 Oct 2008 at 3:14
In the meanwhile we also ran the case of Iker on our Windows-64bit machine -->
same
error at the same simulation time like mentioned by Iker.
As on Windows we use your precompiled Version, we cannot play around as
mentioned in
http://groups.google.com/group/fds-smv/browse_thread/thread/2b32dc7907d34b9
Furthermore, we cannot test version 5.2.1 as only a precompiled 5.1.6
executable is
available.
On Linux I'm using the hints of the discussion mentioned by Kevin...
Original comment by simon.f...@hbi.ch
on 2 Oct 2008 at 9:03
I will ask Simo Hostikka to post 64 bit Windows executables. I cannot test 64
bit
executables here at NIST.
Original comment by mcgra...@gmail.com
on 2 Oct 2008 at 12:24
I just posted an installer fds_5.2.1_win64.exe, containing
both serial and parallel executables.
Original comment by shost...@gmail.com
on 2 Oct 2008 at 1:23
Thanks, Simo. To the other contributors to this thread -- could you re-run your
cases with the posted executables and report here if the problem persists or if
you
are successful.
Original comment by mcgra...@gmail.com
on 2 Oct 2008 at 2:05
Hello all
we ran the case of Iker with the new exexutables from Simo but the run crashed
even
before with the same error message!
Original comment by simon.f...@hbi.ch
on 2 Oct 2008 at 3:13
Simo, could you try running the case. I am concerned that the 64 bit version
traps
underflows, meaning that if a number if very, very small, the code fails rather
than
setting the number to zero. This is what we found for a linux build.
Original comment by mcgra...@gmail.com
on 2 Oct 2008 at 3:21
Kevin, could you please comment on your statement on the problems on linux
systems as
we not only have problems using windows.
I'm playing around with the compiler-flags and the debug version. As compiler I
use
intel 10.1.015.
Original comment by simon.f...@hbi.ch
on 3 Oct 2008 at 6:11
Hello group!!
I was trying to solve the problem, is possible that the problem is caused by
MPICH2?
I have to say that I'm using PYROSIM to create the model and then run the
simulation.
I run a similar model that it crashed too in other computer with winXP 32bit,
in
serial and parallel version and the simulation was fine. Is possible that the
problem is caused by PYROSIM?
Original comment by ikke...@gmail.com
on 3 Oct 2008 at 9:46
Hallo,
I don't think it is a problem with Pyrosim. We (me and Simon) had the same
problem
also with other fds files. In our opinion it is more likely a compilation
problem of
the 64-bit Version.
Original comment by mattia.f...@gmail.com
on 3 Oct 2008 at 11:19
this is a new information of the error of my last compilation, the example is
not
the same but it's similar,i think it may be useful.
regards,
iker
Original comment by ikke...@gmail.com
on 3 Oct 2008 at 11:31
Attachments:
Hi,
I tested your case HALL.fds with our machine (Win 64bit / fds5_mpi_w64.exe
posted
yesterday by Simo). I also get a crash with similar message (see attached).
Original comment by mattia.f...@gmail.com
on 3 Oct 2008 at 12:29
Attachments:
I do not believe the problem has to do with PyroSim. The only thing that
PyroSim
does is write the FDS input file. If FDS has a problem with the input file it
should, and usually does, write out an ERROR message just at the start. If the
calculation runs along for hundreds of time steps, it is no longer a PyroSim
issue.
I also do not believe that this is an MPICH2 problem. If it were, the error
would
occur the first time information was passed. There are about 10 MPI data
exchanges
per time step, so I cannot imagine that MPI would suddenly fail after 5000
successful exchanges.
The error message just posted (error_4.txt) suggests that the error occurs
within
the radiation solver, as numbers are being passed into a subroutine. We noticed
that
the 64 bit compiler may be trapping underflows (numbers that are very, very
small)
rather than just converting them to zero, which is what happens when you
compile
with 32 bit. I cannot reproduce the error on my 32 bit Linux cluster.
The case HALL.fds was run with FDS 5.2.0. Try running with the latest 64 bit
Windows
executable (5.2.1).
Simo -- is there an option to NOT trap underflows? I'll look also.
Original comment by mcgra...@gmail.com
on 3 Oct 2008 at 12:35
crash_HALL.txt indicates a similar error, but in a different call to a
different
subroutine, in the radiation solver. I will try to run the case with full
debugging
and see if something becomes obvious.
Original comment by mcgra...@gmail.com
on 3 Oct 2008 at 12:43
[deleted comment]
Kevin - ftz option for intel fortran seems to do this. And it looks like the
default
(trap underflow) is indeed different for 32bit and 64bit systems.
Simo
Original comment by shost...@gmail.com
on 3 Oct 2008 at 12:59
Wow -- you need a PhD in logic and rhetoric to understand this. But I think it
is
the cause of the problem.
-ftz Flushes denormal results to zero when the application is in the
gradual
underflow mode. It may improve performance if the denormal values are not
critical
to the behavior of your application. The default is -no-ftz on systems using
IA-64
architecture; -ftz on systems using IA-32 architecture and systems using
Intel(R) 64
architecture.
The following options set the -ftz option: -fpe0, -fpe1, and on systems using
IA-64
architecture, option -O3. On systems using IA-64 architecture, option -O2
sets
the -no-ftz option. On systems using IA-32 architecture and systems using
Intel(R)
64 architecture, every optimization option -O level, except -O0, sets -ftz.
Note: Option -ftz is a performance option. Setting it does not guarantee that
all
denormals in a program are flushed to zero. It only causes denormals generated
at
run time to be flushed to zero.
Original comment by mcgra...@gmail.com
on 3 Oct 2008 at 1:43
Ok, I just posted new installer for win64 with option /Qftz
The file has test status.
Could you please try it report here. Thanks.
Original comment by shost...@gmail.com
on 3 Oct 2008 at 1:58
Hello all
As postet above I have problems with the linux 64 bit version. I just recompiled
using the -ftz option.
The case Hall.fds crashed with the appended error message!
Simon
Original comment by simon.f...@hbi.ch
on 3 Oct 2008 at 2:15
Attachments:
I will test the new win 64-bit version and let you know (it will probably last
until
monday).
Original comment by mattia.f...@gmail.com
on 3 Oct 2008 at 2:26
http://groups.google.com/group/fds-smv/browse_thread/thread/02b32dc7907d34b9#
Meanwhile, I introduced the additional compiler options /Qftz and /
fpe3 for the 64 bit Windows case.
FYI: After testing several input files I had a crash with a 8-mesh-
geometry with more than 4 millions of unknowns in one single mesh
(23,5 millions in total). Increasing the 'Stack Reserve Size' and
'Stack Commit Size' to 65536000 in properties/linker/system under
Visual Studio seems to solve the problem.
Original comment by mcgra...@gmail.com
on 3 Oct 2008 at 2:28
By the way: In the meanwhile another case of mine which used to crash is still
running with an fds-executable generated using the debug-mode. The case is
running
longer than ever but is terribly slow.
So out of this, really the compiler options are generating this problem...
Maybe I
will have time to check the initialization of arrays mentioned in:
http://groups.google.com/group/fds-smv/browse_thread/thread/02b32dc7907d34b9#
Lets see if that could change something too.
Original comment by simon.f...@hbi.ch
on 3 Oct 2008 at 2:50
I got the same error (see file crash.txt) also with the new test version.
Original comment by mattia.f...@gmail.com
on 3 Oct 2008 at 3:19
Attachments:
Btw, the stack size in the windows exe was 100,000,000. Should it be more?
Simo
Original comment by shost...@gmail.com
on 3 Oct 2008 at 4:11
Simo -- I do not know what the max stack size is for 64 bit. I think 100 M
should be
enough, but let's keep open the possibility.
If you have a chance on Monday, could you try these calcs. The discussion is
getting
confused because of all the options and versions.
Original comment by mcgra...@gmail.com
on 3 Oct 2008 at 5:35
In the meantime I also tested different other cases and they crashed giving the
same
error.
May I ask what is the state with compiled 64-bit Window version, since I have
some
simulation to run within the next weeks. Is there any new compiled version to
test?
Or alternatively, given that the 32bit version doesn't have this problem, is
there
any method to get it run on 64-bit Window machine. I tried to install the one
from
the download page, but the mpi version doesn't work (the serial version however
yes)?
Mattia
Original comment by mattia.f...@gmail.com
on 8 Oct 2008 at 6:38
I compiled the win64 mpi exe with both options /fpe:1 and /Qftz.
I still got the same error in the HALL.fds case on 64bit Windows XP and
64bit MPICH2. Right now, I don't have even an idea what to try next.
Mattia - on my 64bit XP computer, I can use 32bit fds5_mpi.exe.
Simo
Original comment by shost...@gmail.com
on 8 Oct 2008 at 9:10
Interesting.
I assume,that you use the 32bit fds_mpi.exe with the 64bit MPICH2. I think I
have
some problem letting MPICH2 communicate with the 32bit fds5_mpi.exe version
(since
the serial 32bit fds5.exe works fine).
Did you had to set some particular configuration in order to let the MPICH2
communicate with the 32bit fds version instead of the 64bit?
Do you run mpiexec with the -file option?
Could you post the command you use?
Thank You
Mattia
PS: I am not sure that this is the correct place/issue for this discussion, in
case
let me know
Original comment by mattia.f...@gmail.com
on 8 Oct 2008 at 10:03
Yes, 32bit fds_mpi.exe for 64bit MPICH2. The config.txt file looks like
exe \\espkt4m019\rtesho\fds5_mpi.exe HALL.fds
dir \\espkt4m019\rtesho\Issue_474
hosts
espkt4m019 4
That is, not special settings.
Original comment by shost...@gmail.com
on 8 Oct 2008 at 10:26
Mattia -- could you try running your test case with a smaller MESH size (that
is,
not as many cells). I am not sure whether this problem is related to stack size
or
floating point exceptions. Also, could you post the error message when the
calculation fails so that we can look at the line of code that is causing the
problem. Maybe Simo has already done this -- is the line of code that fails
still a
subroutine call in radi.f90?
Original comment by mcgra...@gmail.com
on 8 Oct 2008 at 12:11
The case I ran was the HALL.fds by iker. Yes, the error took place in radi.f90,
at
CALL GET_KAPPA(Z_VECTOR,Y_SUM(I,J,K),KAPPA_1,TYY,IBND)
I can't check if the reason is really in floating point exceptions or a coding
bug,
becase the case is so huge. If someone gets a similar error in a small case,
please
post here.
Original comment by shost...@gmail.com
on 8 Oct 2008 at 12:34
The case I ran was the HALL.fds by iker. Yes, the error took place in radi.f90,
at
CALL GET_KAPPA(Z_VECTOR,Y_SUM(I,J,K),KAPPA_1,TYY,IBND)
I can't check if the reason is really in floating point exceptions or a coding
bug,
becase the case is so huge. If someone gets a similar error in a small case,
please
post here.
Original comment by shost...@gmail.com
on 8 Oct 2008 at 12:34
Simon:
when I try to use the 32bit version on the 64bit window I get the warning that
a 'fmpich2.dll' is not found. Fanny is that this .dll doesn't exist also on
32bit
window where the 32bit fds5_mpi version works fine. Any idea?
Kevin:
I will reduce the cells number in the model that crash and let it run.
However, I don't think that the cells-number is the problem. I succesfully ran
the
same model with only a different boundary condition. The bad case has an
additional
pressure acting on a opening. Just adding this small constraint causes the
numerical
problem.
Original comment by mattia.f...@gmail.com
on 8 Oct 2008 at 1:43
Wow. I must have been using mpiexec from the 32bit MPICH2. The file
(fmpich2.dll) was
under 64bit version. But the win64 mpi was linked against the 64bit MPICH2
files.
Still, 32bit mpiexec was able to run it.
Very confusing.
Simo
Original comment by shost...@gmail.com
on 8 Oct 2008 at 2:03
I am turning this case over to Simo. I have no way to test 64 bit Windows apps,
but
I will monitor the conversation and perhaps notice some change in coding that
might
help.
Original comment by mcgra...@gmail.com
on 8 Oct 2008 at 2:35
Kevin:
I tested the reduced case and it crashed again (in the attached still_crash.txt
the
error message). Yes failing is still the subroutine call in radi.f90.
The new case had 1'782'912 cells instead of 2'309'080 of the original case. The
reduction took place in each of the 36 meshes.
Simo:
I am little bit confused. did you install the MPICH2 32bit on your 64bit window
machine? Did it work?
Mattia
Original comment by mattia.f...@gmail.com
on 8 Oct 2008 at 2:52
Attachments:
Glenn Forney tells me that we do have one 64 bit Windows PC. Can you reduce the
case
to something I can run on a single machine with maybe two meshes?
Original comment by mcgra...@gmail.com
on 8 Oct 2008 at 3:01
I have a reduced case which also fails on a 64bit window and not on a 32bit
window.
However the error message is sligthy different, i.e. the problem is not by the
radi.f90 subroutine but in the funcf.90 and divg.f90.
I didn't manage to reduce under 8 meshes. Try this case, if it is still to big
I
would make a very simple new model which recreate the same problem, it would
take
however some time.
Original comment by mattia.f...@gmail.com
on 8 Oct 2008 at 4:15
Attachments:
[deleted comment]
Kevin, did you manage to run the case on your PC, or it is still to big?
Original comment by mattia.f...@gmail.com
on 10 Oct 2008 at 7:01
We haven't tried it, but it looks too big. Our machine only has 4 GB. We'll try
it
anyway to see.
Original comment by mcgra...@gmail.com
on 10 Oct 2008 at 12:31
Hello all
As I got some similar problem as Mattia but on Linux 64 bit instead of Windows
64, I
played around with the compiler options.
As an executable created in the debug mode ran longer (simulated time) than any
other
case, i did tried a combination of the "normal" compiler options and the "debug"
compiler options:
intel_linux_mpi_64 : FFLAGS = -O3 -axPTSW -unroll -static -ipo -xPTSW -fpe0 -ftz
-auto -fltconsistency
intel_linux_mpi_64 : CFLAGS = -O3 -Dpp_noappend
intel_linux_mpi_64 : FCOMPL = /opt/mpich2/bin/mpif90
intel_linux_mpi_64 : CCOMPL = /opt/mpich2/bin/mpicc
intel_linux_mpi_64 : obj = fds5_mpi_intel64
intel_linux_mpi_64 : setup $(obj_mpi)
$(FCOMPL) $(FFLAGS) -o $(obj) $(obj_mpi)
until now, the case test_8M.fds from Mattia is still running using this
executable
and has reached timestep 1100 and is still running... (Adding -auto
-fltconsistency
seems to make the difference) Until now I cannot tell if the case is going to
finish
and how much the current compiler options affect the speed of the calculation.
Simo: Could you try these options on Windows and post the resulting executable
for
testing? Thanks!
Original comment by simon.f...@hbi.ch
on 13 Oct 2008 at 11:54
From the ifort man pages...
-fltconsistency
Enables improved floating-point consistency. Floating-point operations are not
reordered and the result of each floating-point operation is stored in the
target
variable rather than being kept in the floating-point processor for use in a
subsequent calculation. This is the same as specifying -mp or -mieee-fp.
The default, -nofltconsistency, provides better accuracy and run-time
performance at
the expense of less consistent floating-point results.
I do not understand how better accuracy is achieved at the expense of less
consistent floating-point results.
Original comment by mcgra...@gmail.com
on 13 Oct 2008 at 4:45
Kevin,
I regarded the ifort manual and do interpret it as follows:
When using the default option for consistency (nothing indicated or using
-nofltconsistency), the compiler can alter the code such as divisions are
changed to
multiplications with the reciprocal value. Like this accuracy can be improved
but
consistency is degraded.
In the manual also the fpe-model is indicated to be of better use than
-fltconsistency. So one should experiment with this to see any improvement on
performance as "-fltconsistency [...] This option enables improved
floating-point
consistency and may slightly reduce execution speed"
I'am not yet sure if this option really makes the difference....
But by the way, the case of Mattia has now reached 14'400 Timesteps and a total
of
220s (case starts at -60s). So I'm pretty optimistic that the case will finish
correctly...
Original comment by simon.f...@hbi.ch
on 14 Oct 2008 at 6:12
I just uploaded a new test executable for parallel 64bit windows (SVN 2485).
No serial version included.
The compiler options were /Qunroll /fpe:0 /Qftz /automatic /fltconsistency
I can't use -ipo on Windows, because the objet files gets huge, and the linker
never
get's its job done.
Original comment by shost...@gmail.com
on 14 Oct 2008 at 7:23
The 64bit-window version works also better than the previous one. A case that
crashed after about 500 time steps (30s) is now after 1500 time steps (45s)
stil
running.
Mattia
Original comment by mattia.f...@gmail.com
on 14 Oct 2008 at 10:00
I ran multiple cases using my "new" executable and none of them crashed with an
error
like before. Therefore I would say the problem is somehow solved, even though
the
underlying defect is not identified yet.
It looks really like the -fltconsistency option makes the difference. I
recompiled
fds using different alternatives (e.g. -fp-model XXX or -mp1) as indicated by
the
manual but none of these options returned a working executable.
In addition I got working executables no matter if -fpe0 or -fpe3 is used.
Simon
Original comment by simon.f...@hbi.ch
on 16 Oct 2008 at 9:35
Original issue reported on code.google.com by
ikke...@gmail.com
on 24 Sep 2008 at 8:55Attachments: