Yinan-Scott-Shi / fds-smv

Automatically exported from code.google.com/p/fds-smv
0 stars 0 forks source link

Access Violations on 64 bit Windows #474

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Please complete the following lines...

Application Version: 5.2.0 parallel
SVN Revision Number: 2102
Compile Date:9/19/2008
Operating System: windows vista business 64bit

Describe details of the issue below:

I don´t know why the program makes this error, I think that the FDS file 
is ok, but it don't complet de calculation. This error only happens when I 
run the parallel version.

regards,
Iker 

Original issue reported on code.google.com by ikke...@gmail.com on 24 Sep 2008 at 8:55

Attachments:

GoogleCodeExporter commented 9 years ago
we'll take a look at it

Original comment by gfor...@gmail.com on 24 Sep 2008 at 12:15

GoogleCodeExporter commented 9 years ago
I have run the case past the point of failure on my 32 bit linux cluster using 
version 5.2.1. I do not have a 64 bit Windows platform to test on. Try running 
the 
case using 5.2.1, and report whether you are using the executables that are 
distributed via the website, or if you have compiled your own version.

Original comment by mcgra...@gmail.com on 30 Sep 2008 at 1:02

GoogleCodeExporter commented 9 years ago
Hello all I tried to run the case of Iker on our 64bit Linux Machine (SES 10), 
with
our self-compiled FDS version (intel compiler):

Compilation Date : Mit, 13 Aug 2008
Version          : 5.2.0 Parallel
SVN Revision No. : 2166

I got the following output:

 Job TITLE        : ATRIO_CENTRAL_2
 Job ID string    : ATRIO_CENTRAL_2

 Time Step:      1,    Simulation Time:      0.06 s
 Time Step:      2,    Simulation Time:      0.11 s
 Time Step:      3,    Simulation Time:      0.15 s
 Time Step:      4,    Simulation Time:      0.18 s
 Time Step:      5,    Simulation Time:      0.21 s
 Time Step:      6,    Simulation Time:      0.23 s
 Time Step:      7,    Simulation Time:      0.25 s
 Time Step:      8,    Simulation Time:      0.27 s
 Time Step:      9,    Simulation Time:      0.29 s
 Time Step:     10,    Simulation Time:      0.30 s
 Time Step:     20,    Simulation Time:      0.44 s
 Time Step:     30,    Simulation Time:      0.54 s
 Time Step:     40,    Simulation Time:      0.63 s
 Time Step:     50,    Simulation Time:      0.70 s
 Time Step:     60,    Simulation Time:      0.78 s
 Time Step:     70,    Simulation Time:      0.84 s
 Time Step:     80,    Simulation Time:      0.91 s
 Time Step:     90,    Simulation Time:      0.97 s
 Time Step:    100,    Simulation Time:      1.03 s
 Time Step:    200,    Simulation Time:      1.54 s
 Time Step:    300,    Simulation Time:      2.00 s
 Time Step:    400,    Simulation Time:      2.44 s
 Time Step:    500,    Simulation Time:      2.88 s
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
fds5_mpi_intel     000000000050939B  Unknown               Unknown  Unknown
fds5_mpi_intel     0000000000506DC1  Unknown               Unknown  Unknown
fds5_mpi_intel     0000000000820CCB  Unknown               Unknown  Unknown
fds5_mpi_intel     00000000004046E2  Unknown               Unknown  Unknown
libc.so.6          00002AAC8E76B154  Unknown               Unknown  Unknown
fds5_mpi_intel     0000000000404629  Unknown               Unknown  Unknown
rank 3 in job 2  CFD-Workstation_9015   caused collective abort of all ranks
  exit status of rank 3: return code 174

Original comment by simon.f...@hbi.ch on 1 Oct 2008 at 1:30

GoogleCodeExporter commented 9 years ago
http://groups.google.com/group/fds-smv/browse_thread/thread/2b32dc7907d34b9

discusses difficulties compiling a 64 bit FDS executable. I suspect that these 
problems you are having might have to do with either the stack size or floating 
point underflows. Could you read the above Discussion thread and tell me if 
anything 
helps. Also, use 5.2.1

Original comment by mcgra...@gmail.com on 1 Oct 2008 at 3:14

GoogleCodeExporter commented 9 years ago
In the meanwhile we also ran the case of Iker on our Windows-64bit machine --> 
same
error at the same simulation time like mentioned by Iker.

As on Windows we use your precompiled Version, we cannot play around as 
mentioned in 
http://groups.google.com/group/fds-smv/browse_thread/thread/2b32dc7907d34b9

Furthermore, we cannot test version 5.2.1 as only a precompiled 5.1.6 
executable is
available.

On Linux I'm using the hints of the discussion mentioned by Kevin...

Original comment by simon.f...@hbi.ch on 2 Oct 2008 at 9:03

GoogleCodeExporter commented 9 years ago
I will ask Simo Hostikka to post 64 bit Windows executables. I cannot test 64 
bit 
executables here at NIST.

Original comment by mcgra...@gmail.com on 2 Oct 2008 at 12:24

GoogleCodeExporter commented 9 years ago
I just posted an installer fds_5.2.1_win64.exe, containing
both serial and parallel executables.

Original comment by shost...@gmail.com on 2 Oct 2008 at 1:23

GoogleCodeExporter commented 9 years ago
Thanks, Simo. To the other contributors to this thread -- could you re-run your 
cases with the posted executables and report here if the problem persists or if 
you 
are successful. 

Original comment by mcgra...@gmail.com on 2 Oct 2008 at 2:05

GoogleCodeExporter commented 9 years ago
Hello all

we ran the case of Iker with the new exexutables from Simo but the run crashed 
even
before with the same error message!

Original comment by simon.f...@hbi.ch on 2 Oct 2008 at 3:13

GoogleCodeExporter commented 9 years ago
Simo, could you try running the case. I am concerned that the 64 bit version 
traps 
underflows, meaning that if a number if very, very small, the code fails rather 
than 
setting the number to zero. This is what we found for a linux build.

Original comment by mcgra...@gmail.com on 2 Oct 2008 at 3:21

GoogleCodeExporter commented 9 years ago
Kevin, could you please comment on your statement on the problems on linux 
systems as
we not only have problems using windows.

I'm playing around with the compiler-flags and the debug version. As compiler I 
use
intel 10.1.015.

Original comment by simon.f...@hbi.ch on 3 Oct 2008 at 6:11

GoogleCodeExporter commented 9 years ago
Hello group!!
I was trying to solve the problem, is possible that the problem is caused by 
MPICH2?
I have to say that I'm using PYROSIM to create the model and then run the 
simulation.
I run a similar model that it crashed too in other computer with winXP 32bit, 
in 
serial and parallel version and the simulation was fine. Is possible that the 
problem is caused by PYROSIM?

Original comment by ikke...@gmail.com on 3 Oct 2008 at 9:46

GoogleCodeExporter commented 9 years ago
Hallo,

I don't think it is a problem with Pyrosim. We (me and Simon) had the same 
problem 
also with other fds files. In our opinion it is more likely a compilation 
problem of 
the 64-bit Version.

Original comment by mattia.f...@gmail.com on 3 Oct 2008 at 11:19

GoogleCodeExporter commented 9 years ago
this is a new information of the error of my last compilation, the example is 
not 
the same but it's similar,i think it may be useful.
regards,
iker

Original comment by ikke...@gmail.com on 3 Oct 2008 at 11:31

Attachments:

GoogleCodeExporter commented 9 years ago
Hi,

I tested your case HALL.fds with our machine (Win 64bit / fds5_mpi_w64.exe 
posted 
yesterday by Simo). I also get a crash with similar message (see attached).

Original comment by mattia.f...@gmail.com on 3 Oct 2008 at 12:29

Attachments:

GoogleCodeExporter commented 9 years ago
I do not believe the problem has to do with PyroSim. The only thing that 
PyroSim 
does is write the FDS input file. If FDS has a problem with the input file it 
should, and usually does, write out an ERROR message just at the start. If the 
calculation runs along for hundreds of time steps, it is no longer a PyroSim 
issue. 
I also do not believe that this is an MPICH2 problem. If it were, the error 
would 
occur the first time information was passed. There are about 10 MPI data 
exchanges 
per time step, so I cannot imagine that MPI would suddenly fail after 5000 
successful exchanges. 

The error message just posted (error_4.txt) suggests that the error occurs 
within 
the radiation solver, as numbers are being passed into a subroutine. We noticed 
that 
the 64 bit compiler may be trapping underflows (numbers that are very, very 
small) 
rather than just converting them to zero, which is what happens when  you 
compile 
with 32 bit. I cannot reproduce the error on my 32 bit Linux cluster.

The case HALL.fds was run with FDS 5.2.0. Try running with the latest 64 bit 
Windows 
executable (5.2.1). 

Simo -- is there an option to NOT trap underflows? I'll look also. 

Original comment by mcgra...@gmail.com on 3 Oct 2008 at 12:35

GoogleCodeExporter commented 9 years ago
crash_HALL.txt indicates a similar error, but in a different call to a 
different 
subroutine, in the radiation solver. I will try to run the case with full 
debugging 
and see if something becomes obvious.

Original comment by mcgra...@gmail.com on 3 Oct 2008 at 12:43

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Kevin - ftz option for intel fortran seems to do this. And it looks like the 
default
(trap underflow) is indeed different for 32bit and 64bit systems.

Simo

Original comment by shost...@gmail.com on 3 Oct 2008 at 12:59

GoogleCodeExporter commented 9 years ago
Wow -- you need a PhD in logic and rhetoric to understand this. But I think it 
is 
the cause of the problem.

-ftz   Flushes  denormal  results to zero when the application is in the 
gradual 
underflow mode. It may improve performance if the denormal values are not 
critical 
to the behavior of your application. The default  is -no-ftz  on systems using 
IA-64 
architecture; -ftz on systems using IA-32 architecture and systems using 
Intel(R) 64 
architecture.

The following options set the -ftz option: -fpe0, -fpe1, and on systems using 
IA-64 
architecture, option -O3.   On  systems using IA-64 architecture, option -O2 
sets 
the -no-ftz option.  On systems using IA-32 architecture and systems using 
Intel(R) 
64 architecture, every optimization option -O level, except -O0, sets -ftz.

Note: Option -ftz is a performance option. Setting it does not guarantee that 
all 
denormals in a program are flushed to zero. It only causes denormals generated 
at 
run time to be flushed to zero.

Original comment by mcgra...@gmail.com on 3 Oct 2008 at 1:43

GoogleCodeExporter commented 9 years ago
Ok, I just posted new installer for win64 with option /Qftz 
The file has test status.
Could you please try it report here. Thanks.

Original comment by shost...@gmail.com on 3 Oct 2008 at 1:58

GoogleCodeExporter commented 9 years ago
Hello all

As postet above I have problems with the linux 64 bit version. I just recompiled
using the -ftz option.

The case Hall.fds crashed with the appended error message!

Simon

Original comment by simon.f...@hbi.ch on 3 Oct 2008 at 2:15

Attachments:

GoogleCodeExporter commented 9 years ago
I will test the new win 64-bit version and let you know (it will probably last 
until 
monday).

Original comment by mattia.f...@gmail.com on 3 Oct 2008 at 2:26

GoogleCodeExporter commented 9 years ago
http://groups.google.com/group/fds-smv/browse_thread/thread/02b32dc7907d34b9#

Meanwhile, I introduced the additional compiler options /Qftz and / 
fpe3 for the 64 bit Windows case.  

FYI: After testing several input files I had a crash with a 8-mesh- 
geometry with more than 4 millions of unknowns in one single mesh 
(23,5 millions in total). Increasing the  'Stack Reserve Size' and 
'Stack Commit Size'  to 65536000 in properties/linker/system under 
Visual Studio seems to solve the problem. 

Original comment by mcgra...@gmail.com on 3 Oct 2008 at 2:28

GoogleCodeExporter commented 9 years ago
By the way: In the meanwhile another case of mine which used to crash is still
running with an fds-executable generated using the debug-mode. The case is 
running
longer than ever but is terribly slow.

So out of this, really the compiler options are generating this problem... 
Maybe I
will have time to check the initialization of arrays mentioned in:
http://groups.google.com/group/fds-smv/browse_thread/thread/02b32dc7907d34b9#

Lets see if that could change something too.

Original comment by simon.f...@hbi.ch on 3 Oct 2008 at 2:50

GoogleCodeExporter commented 9 years ago
I got the same error (see file crash.txt) also with the new test version.

Original comment by mattia.f...@gmail.com on 3 Oct 2008 at 3:19

Attachments:

GoogleCodeExporter commented 9 years ago
Btw, the stack size in the windows exe was 100,000,000. Should it be more?

Simo

Original comment by shost...@gmail.com on 3 Oct 2008 at 4:11

GoogleCodeExporter commented 9 years ago
Simo -- I do not know what the max stack size is for 64 bit. I think 100 M 
should be 
enough, but let's keep open the possibility.

If you have a chance on Monday, could you try these calcs. The discussion is 
getting 
confused because of all the options and versions.

Original comment by mcgra...@gmail.com on 3 Oct 2008 at 5:35

GoogleCodeExporter commented 9 years ago
In the meantime I also tested different other cases and they crashed giving the 
same 
error.
May I ask what is the state with compiled 64-bit Window version, since I have 
some 
simulation to run within the next weeks. Is there any new compiled version to 
test?

Or alternatively, given that the 32bit version doesn't have this problem, is 
there 
any method to get it run on 64-bit Window machine. I tried to install the one 
from 
the download page, but the mpi version doesn't work (the serial version however 
yes)?

Mattia

Original comment by mattia.f...@gmail.com on 8 Oct 2008 at 6:38

GoogleCodeExporter commented 9 years ago
I compiled the win64 mpi exe with both options /fpe:1 and /Qftz.
I still got the same error in the HALL.fds case on 64bit Windows XP and
64bit MPICH2. Right now, I don't have even an idea what to try next. 

Mattia - on my 64bit XP computer, I can use 32bit fds5_mpi.exe.

Simo

Original comment by shost...@gmail.com on 8 Oct 2008 at 9:10

GoogleCodeExporter commented 9 years ago
Interesting.
I assume,that you use the 32bit fds_mpi.exe with the 64bit MPICH2. I think I 
have 
some problem letting MPICH2 communicate with the 32bit fds5_mpi.exe version 
(since 
the serial 32bit fds5.exe works fine).
Did you had to set some particular configuration in order to let the MPICH2 
communicate with the 32bit fds version instead of the 64bit?
Do you run mpiexec with the -file option? 
Could you post the command you use? 

Thank You
Mattia

PS: I am not sure that this is the correct place/issue for this discussion, in 
case 
let me know

Original comment by mattia.f...@gmail.com on 8 Oct 2008 at 10:03

GoogleCodeExporter commented 9 years ago
Yes, 32bit fds_mpi.exe for 64bit MPICH2. The config.txt file looks like

exe \\espkt4m019\rtesho\fds5_mpi.exe HALL.fds
dir \\espkt4m019\rtesho\Issue_474
hosts
espkt4m019 4

That is, not special settings.

Original comment by shost...@gmail.com on 8 Oct 2008 at 10:26

GoogleCodeExporter commented 9 years ago
Mattia -- could you try running your test case with a smaller MESH size (that 
is, 
not as many cells). I am not sure whether this problem is related to stack size 
or 
floating point exceptions. Also, could you post the error message when the 
calculation fails so that we can look at the line of code that is causing the 
problem. Maybe Simo has already done this -- is the line of code that fails 
still a 
subroutine call in radi.f90?

Original comment by mcgra...@gmail.com on 8 Oct 2008 at 12:11

GoogleCodeExporter commented 9 years ago
The case I ran was the HALL.fds by iker. Yes, the error took place in radi.f90, 
at 

  CALL GET_KAPPA(Z_VECTOR,Y_SUM(I,J,K),KAPPA_1,TYY,IBND) 

I can't check if the reason is really in floating point exceptions or a coding 
bug,
becase the case is so huge. If someone gets a similar error in a small case, 
please
post here.

Original comment by shost...@gmail.com on 8 Oct 2008 at 12:34

GoogleCodeExporter commented 9 years ago
The case I ran was the HALL.fds by iker. Yes, the error took place in radi.f90, 
at 

  CALL GET_KAPPA(Z_VECTOR,Y_SUM(I,J,K),KAPPA_1,TYY,IBND) 

I can't check if the reason is really in floating point exceptions or a coding 
bug,
becase the case is so huge. If someone gets a similar error in a small case, 
please
post here.

Original comment by shost...@gmail.com on 8 Oct 2008 at 12:34

GoogleCodeExporter commented 9 years ago
Simon:
when I try to use the 32bit version on  the 64bit window I get the warning that 
a 'fmpich2.dll' is not found. Fanny is that this .dll doesn't exist also on 
32bit 
window where the 32bit fds5_mpi version works fine. Any idea?

Kevin:
I will reduce the cells number in the model that crash and let it run. 
However, I don't think that the cells-number is the problem. I succesfully ran 
the 
same model with only a different boundary condition. The bad case has an 
additional 
pressure acting on a opening. Just adding this small constraint causes the 
numerical 
problem.

Original comment by mattia.f...@gmail.com on 8 Oct 2008 at 1:43

GoogleCodeExporter commented 9 years ago
Wow. I must have been using mpiexec from the 32bit MPICH2. The file 
(fmpich2.dll) was
under 64bit version. But the win64 mpi was linked against the 64bit MPICH2 
files.
Still, 32bit mpiexec was able to run it.

Very confusing.

Simo

Original comment by shost...@gmail.com on 8 Oct 2008 at 2:03

GoogleCodeExporter commented 9 years ago
I am turning this case over to Simo. I have no way to test 64 bit Windows apps, 
but 
I will monitor the conversation and perhaps notice some change in coding that 
might 
help.

Original comment by mcgra...@gmail.com on 8 Oct 2008 at 2:35

GoogleCodeExporter commented 9 years ago
Kevin:
I tested the reduced case and it crashed again (in the attached still_crash.txt 
the 
error message). Yes failing is still the subroutine call in radi.f90.
The new case had 1'782'912 cells instead of 2'309'080 of the original case. The 
reduction took place in each of the 36 meshes.

Simo:
I am little bit confused. did you install the MPICH2 32bit on your 64bit window 
machine? Did it work?

Mattia

Original comment by mattia.f...@gmail.com on 8 Oct 2008 at 2:52

Attachments:

GoogleCodeExporter commented 9 years ago
Glenn Forney tells me that we do have one 64 bit Windows PC. Can you reduce the 
case 
to something I can run on a single machine with maybe two meshes?

Original comment by mcgra...@gmail.com on 8 Oct 2008 at 3:01

GoogleCodeExporter commented 9 years ago
I have a reduced case which also fails on a 64bit window and not on a 32bit 
window. 
However the error message is sligthy different, i.e. the problem is not by the 
radi.f90 subroutine but in the funcf.90 and divg.f90.
I didn't manage to reduce under 8 meshes. Try this case, if it is still to big 
I 
would make a very simple new model which recreate the same problem, it would 
take 
however some time.

Original comment by mattia.f...@gmail.com on 8 Oct 2008 at 4:15

Attachments:

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Kevin, did you manage to run the case on your PC, or it is still to big?

Original comment by mattia.f...@gmail.com on 10 Oct 2008 at 7:01

GoogleCodeExporter commented 9 years ago
We haven't tried it, but it looks too big. Our machine only has 4 GB. We'll try 
it 
anyway to see.

Original comment by mcgra...@gmail.com on 10 Oct 2008 at 12:31

GoogleCodeExporter commented 9 years ago
Hello all

As I got some similar problem as Mattia but on Linux 64 bit instead of Windows 
64, I
played around with the compiler options.

As an executable created in the debug mode ran longer (simulated time) than any 
other
case, i did tried a combination of the "normal" compiler options and the "debug"
compiler options:

intel_linux_mpi_64 : FFLAGS = -O3 -axPTSW -unroll -static -ipo -xPTSW -fpe0 -ftz
-auto -fltconsistency
intel_linux_mpi_64 : CFLAGS = -O3 -Dpp_noappend
intel_linux_mpi_64 : FCOMPL = /opt/mpich2/bin/mpif90
intel_linux_mpi_64 : CCOMPL = /opt/mpich2/bin/mpicc
intel_linux_mpi_64 : obj = fds5_mpi_intel64
intel_linux_mpi_64 : setup $(obj_mpi)
    $(FCOMPL) $(FFLAGS) -o $(obj) $(obj_mpi)

until now, the case test_8M.fds from Mattia is still running using this 
executable
and has reached timestep 1100 and is still running... (Adding -auto 
-fltconsistency
seems to make the difference) Until now I cannot tell if the case is going to 
finish
and how much the current compiler options affect the speed of the calculation.

Simo: Could you try these options on Windows and post the resulting executable 
for
testing? Thanks!

Original comment by simon.f...@hbi.ch on 13 Oct 2008 at 11:54

GoogleCodeExporter commented 9 years ago
From the ifort man pages...

-fltconsistency

Enables  improved floating-point consistency. Floating-point operations are not 
reordered and the result of each floating-point operation is stored in the 
target 
variable rather than being kept in  the  floating-point  processor for use in a 
subsequent calculation.  This is the same as specifying -mp or -mieee-fp.

The default, -nofltconsistency, provides better accuracy and run-time 
performance at 
the expense of less consistent floating-point results.

I do not understand how better accuracy is achieved at the expense of less 
consistent floating-point results. 

Original comment by mcgra...@gmail.com on 13 Oct 2008 at 4:45

GoogleCodeExporter commented 9 years ago
Kevin,

I regarded the ifort manual and do interpret it as follows:

When using the default option for consistency (nothing indicated or using
-nofltconsistency), the compiler can alter the code such as divisions are 
changed to
multiplications with the reciprocal value. Like this accuracy can be improved 
but
consistency is degraded.

In the manual also the fpe-model is indicated to be of better use than
-fltconsistency. So one should experiment with this to see any improvement on
performance as "-fltconsistency [...] This option enables improved 
floating-point
consistency and may slightly reduce execution speed"

I'am not yet sure if this option really makes the difference....

But by the way, the case of Mattia has now reached 14'400 Timesteps and a total 
of
220s (case starts at -60s). So I'm pretty optimistic that the case will finish
correctly...

Original comment by simon.f...@hbi.ch on 14 Oct 2008 at 6:12

GoogleCodeExporter commented 9 years ago
I just uploaded a new test executable for parallel 64bit windows (SVN 2485).
No serial version included.

The compiler options were /Qunroll /fpe:0 /Qftz /automatic /fltconsistency

I can't use -ipo on Windows, because the objet files gets huge, and the linker 
never
get's its job done.

Original comment by shost...@gmail.com on 14 Oct 2008 at 7:23

GoogleCodeExporter commented 9 years ago
The 64bit-window version works also better than the previous one. A case that 
crashed after about 500 time steps (30s) is now after 1500 time steps (45s) 
stil 
running.

Mattia

Original comment by mattia.f...@gmail.com on 14 Oct 2008 at 10:00

GoogleCodeExporter commented 9 years ago
I ran multiple cases using my "new" executable and none of them crashed with an 
error
like before. Therefore I would say the problem is somehow solved, even though 
the
underlying defect is not identified yet. 

It looks really like the -fltconsistency option makes the difference. I 
recompiled
fds using different alternatives (e.g. -fp-model XXX or -mp1) as indicated by 
the
manual but none of these options returned a working executable.

In addition I got working executables no matter if -fpe0 or -fpe3 is used.

Simon

Original comment by simon.f...@hbi.ch on 16 Oct 2008 at 9:35