Closed GoogleCodeExporter closed 9 years ago
Agreed. I think that file formatting is the only issue. I do not think that
there is
any particular limitation as far as FDS is concerned. I'll make the change.
Original comment by mcgra...@gmail.com
on 17 Mar 2008 at 4:14
I committed some changes that should allow you to run between 100 and 999
meshes. I
am still using somewhat old-fashioned Fortran write statements. I am now able
to run
a 128 mesh case, and I think I fixed all the formatting problems. I will mark
as
Fixed for now, but let me know if you still have problems. We will probably
still
suffer some growing pains.
BTW, we are currently testing a new pressure correction scheme that requires
more
MPI exchanges per time step. Unfortunately, we cannot determine whether some of
the
strange scaling results we are seeing are due to our linux cluster or MPI
programming. Would it be possible to do a few tests on your machine? Last time
we
did this exercise, your machine was perfect in that the CPUs were all almost
100%
busy during the whole run. Our machines have the same problem as Dave McGill's
cluster in Canada. It always seems that one of the processors on a multiple
processor machine is more active/efficient than the other.
Original comment by mcgra...@gmail.com
on 17 Mar 2008 at 5:06
Thanks for the changes.
We can do some tests on the machine I use, if this will help you. I think you
should
create some simple examples and post it (or send to my email adress), then I
will
start them at the machine. If you create the test cases, use, if it is
possible,
only 32, 64, 96, 128 ... (+32) meshes, thus this is the best for this machine.
Each
node consists of 32 cores shared memory, thus a x * 32mesh case is optimal for
this
machine.
Original comment by crog...@gmail.com
on 18 Mar 2008 at 10:43
OK, we will put together a suite of test cases with meshes of 32, 64 and 128,
and
post them when they are ready to this Tracker. I will keep this marked Fixed,
but
Open so we can use it to exchange info.
Original comment by mcgra...@gmail.com
on 24 Mar 2008 at 12:02
Christian,
I have created four input files with 4, 16, 64, and 256 meshes. Sorry, I just
noticed
that you requested +32 increments. I can create more if we need, but hopefully
the
64 and 256 cases should fulfill the requirement. You should be able to
retrieve the
files from the link below. Let me know if you have any problems. The cases
should
each run for about 100 time steps. Could you please post the .out files for
each run?
Thanks!
Randy
http://groups.google.com/group/fds-smv/web/Bernardo_Trails.zip
Original comment by randy.mc...@gmail.com
on 24 Mar 2008 at 10:44
Randy,
I startet all the files but the machine has some problems with the memory
allocation, thus I had no success. I will contact next week the people from
support,
if they have an explanation for these problems. The problem occurs with all
files
(4, 16, 64 and 256). I attached you my error messages file of the 256 mesh
case,
thus you can see what messages I got.
I will notice you next week.
Best Regards,
Christian
PS: Just one suggestion: We should also test cases with 32 and 128 meshes.
Because
up to 32 meshes we have shared memory on the machine, this means no "cable MPI
transfer", only MPI via memory. If we use 64, 128, 256 meshes, we can calculate
the "cable MPI sending time", if we know the memory MPI time from the 4, 16 and
32
case.
Original comment by crog...@gmail.com
on 28 Mar 2008 at 12:28
Attachments:
Christian,
Thanks for giving this a try. I did check to see that each of these cases ran
on our
Linux cluster, so I will have to defer to Kevin to see if he has an explanation
for
the error messages you are seeing. In the mean time, I will put the 32 and 128
mesh
input files together.
Cheers,
Randy
Original comment by randy.mc...@gmail.com
on 28 Mar 2008 at 12:36
Christian,
I ran the 4 mesh case (worst case for memory) in debug mode and did not have any
problems, except that is looked like process zero was taking about twice the
memory
of the other processes. Can you try to reducing the z resolution on each of the
cases? Cut it is half until until you see that the cases can fit in memory.
So, for
example, change the MESH line from --> to
&MESH IJK=200,200,40,... --> &MESH IJK=200,200,20,...
(NOTE: J should be 200, by the way, for the 4 mesh case! I think I made a
mistake and
it might be 100 in the file I posted.)
for all the MESH lines in each of the files. This should halve the memory
requirement. Maybe even try K=10 if 20 does not work.
Cheers,
Randy
Original comment by randy.mc...@gmail.com
on 28 Mar 2008 at 1:33
Ok, actually I was wrong in comment #8 about n0 using twice the memory. I did
not
realize Glenn Forney was also using the cluster. I just did a clean run and it
looks
like the memory is balanced. However, I did run into some intermittent problems
until I cut the K resolution to 20. So, give that a try and let me know if it
helps.
Thanks,
Randy
Original comment by randy.mc...@gmail.com
on 28 Mar 2008 at 1:53
Randy,
I tested the 4 Mesh case with &MESH IJK=200,200,20,... but there are still
problems.
I think, this is a problem which is based on the machine. RAM is not the
problem,
(3.5 GB per core), this can be devided in STACK and DATA "RAM".
I used the standard values (3.0GB is DATA, 0.5GB is STACK), but I also changed
the
values. This had no success. The definition of STACK and DATA is:
In FORTRAN terms, stack is used for:
- code compiled with XLF compiler option "-qnosave"
- subprogram calling information
- local variables, including arrays, unless they are marked SAVE
Data is used for:
- code compiled with XLF compiler option "-qsave"
- program code
- static variables, including COMMON variables and variables marked SAVE
- memory allocated by ALLOCATE - known as 'heap' variables
- buffers allocated by MPI
- the I/O system
I told my problem the "Supercomputer Team" from JSC, Jülich. If I got an
answer or
solution, I will post the solution (and hopefully the results) to the
issue-tracker.
Best Regard,
Christian
Original comment by crog...@gmail.com
on 2 Apr 2008 at 8:16
Randy,
The problem was the compiling. With help of Armin Seyfried and Bernd Koerfgen
from
Jülich Supercomputing Centre I was able to compile the code and it works (I
compiled
32bit, 64bit was the solution). If all 4 cases are finished I will post the
.out
file for each case. I actually compiled only with -03 -q64 option, other
settings
could produce a faster code, but I have to test if "aggressive" optimization
produces the same results. Here are the makefile settings I used for compiling
on an
AIX-System (maybe it could be added to the makefile):
#AIX, JUMP, MPI-Version
AIX_MPI : FFLAGS = -O3 -q64
AIX_MPI : CFLAGS = -O3 -Dpp_noappend -q64
AIX_MPI : FCOMPL = mpxlf90
AIX_MPI : CCOMPL = mpcc
AIX_MPI : obj = fds5_jump_mpi_64
AIX_MPI : $(obj_mpi)
$(FCOMPL) $(FFLAGS) -o $(obj) $(obj_mpi)
Original comment by crog...@gmail.com
on 2 Apr 2008 at 10:20
Christian,
That is great. Thanks. I will add the makefile info.
I had mentioned that I would create other input files for +32 mesh cases. I
just
posted a new Bernardo_Trails2.zip to the discussion group:
http://groups.google.com/group/fds-smv/web/Bernardo_Trails2.zip
There is now a 32 grid case and a 128 grid case. But note that they all use 20
cells
in z. So if you decide to use these make sure that all the other runs also use
the
20 cells in z.
Cheers,
Randy
Original comment by randy.mc...@gmail.com
on 2 Apr 2008 at 11:37
Randy,
here are the .out files for the original cases with &MESH IJK=200,100,40,...
The 256 MESH cases needs more than 10 minutes to start the calculation, I think
this
is based on the high MPI traffic for "Mesh finding".
If there are no changes in your 32 and 128 mesh test cases than changing z from
20
to 40 I will start this cases, too. Please confirm, that I only have to change
the z-
value from 20 to 40, thus they are equal with the "old" cases.
Best Regards
Christian
Original comment by crog...@gmail.com
on 3 Apr 2008 at 12:21
Attachments:
Christian,
Thanks! Yes, the only change in the 32 and 128 cases is the z dimension.
However, I am a little worried about your 4 mesh case. In the note you just
posted
you mentioned "&MESH IJK=200,100,40". I pointed out earlier that this was a
mistake
I made in the 4 mesh input file... it should be "&MESH IJK=200,200,40".
Sorry! Can you double check that this is correct in the case you actually ran,
otherwise the scaling results will not be relevant.
Best,
Randy
Original comment by randy.mc...@gmail.com
on 3 Apr 2008 at 12:35
Randy,
I forgot the changes... here are the results of the corrected 4 Mesh case
with "&MESH IJK=200,200,40..."
I will also change the z coordinate in the 32 and 128 mesh case. The results
will
follow.
Regards,
Christian
Original comment by crog...@gmail.com
on 3 Apr 2008 at 12:58
Attachments:
Now here is the 32 mesh result with z = 40.
The 128 mesh case is in queue, I think it will be finished tomorrow.
Original comment by crog...@gmail.com
on 3 Apr 2008 at 1:16
Attachments:
Randy,
the machine in Jülich was very fast, thus I can present the 128mesh result. If
you
have finished the "speed-up" analysis, I would be very interested in it.
If you need some other files from the test cases, please write, thus I can
download
it from the machine.
Best Regards
Christian
Original comment by crog...@gmail.com
on 3 Apr 2008 at 2:15
Attachments:
Christian,
Thanks! This is excellent. I am working on getting the timings together right
now.
The load balancing looks very good, but it looks like we are not accounting for
the
wall clock time in the time step loop correctly yet. That is, the cpu time in
the
subroutines does not seem to add up to the total cpu time in main_mpi. As you
pointed out, this probably has to do with the geometry set up in some way...
but this
still does not seem to account for everything. So, Kevin and I are working to
get
this sorted out and will get back to you asap.
Again, many thanks for running these cases!
Cheers,
Randy
Original comment by randy.mc...@gmail.com
on 3 Apr 2008 at 2:21
I just poked around with Google and found this discussion of CPU_TIME in AIX
Fortran. CPU_TIME is what we use to get the total CPU time of the calculation.
It is
a standard call in Fortran 95, but its interpretation is compiler dependent.
What we
need to know is how your machine is interpreting CPU_TIME. Apparently, AIX
assumes,
by default, that CPU_TIME is the total user and system time. To change this,
you
need to set an XLFRTEOPTS environment variable. If I recall from my days when I
used
to run on an IBM, I would do an "env" command to see what the settings were for
the
machine. For more details --
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?
topic=/com.ibm.xlf101a.doc/xlflr/cpu_time.htm
Original comment by mcgra...@gmail.com
on 3 Apr 2008 at 2:47
Christian,
The scaling looks good! I have attached a few files below. The first is a
jpeg of
the scaling. The next is the Matlab script I used to generate the plot in case
you
want to change it at all (do you use Matlab?). There is also an Excel file
where I
computed the wall clock time. I took these timings from the time stamp on the
iteration from the .out file. The other .dat files that I have posted are just
the
total wall clock time for each cpu and then there is a Matlab script to read
these
files and generate a bar chart showing the cpu load.
It is interesting to note in the scaling plot that the best improvement comes
from
going from 32 to 64 meshes. Up to 32 processes the cores are accessing the same
memory. As we get up to 128 processes we are starting to hit one of two limits,
either the MPI "surface to volume" limit or we might be seeing the limit of the
direct linear solve (Gauss Jordan Elimination) on the coarse grid Poisson
solve. I
doubt that the coarse grid solve is a problem at this point (inverting a 256x256
matrix should not be too hard) but I am not sure... eventually we will use
Conjugate
Gradient here for large, multi-processor runs.
Thanks for your help with all this!
Cheers,
Randy
Original comment by randy.mc...@gmail.com
on 3 Apr 2008 at 3:55
Attachments:
Kevin,
you are right with your assumption, that CPU_TIME is sum of 'total user time'
and 'system time'. It is possible to change, so if you need a "special"
CPU_TIME
please let me know. The possibilities are shown in the link you posted.
Randy,
I do not us Matlab, I use gnuplot for creating graphs. This is free and
sufficient
for my usage. Can you explain what your plannings with a CG for large multi-
processor runs are? Have you tried to implement a CG solver in the code, or is
this
just "the last solution" for large multi-processor runs?
Regards,
Christian
Original comment by crog...@gmail.com
on 4 Apr 2008 at 1:28
The current CPU_TIME setting is OK. We need to put more subroutine timers into
the
code to determine where the system is idling in the 256 mesh case. Ideally, we
want
to account for where the computers are either working or idling by summing up
the
CPU times for all the major parts of the code and checking that they add up to
something close to the total CPU time. In the 256 mesh case, the MAIN CPU usage
is
far greater than the sum of the subroutine CPU usage. This means that we are
not
counting all the routines and do not know where the waste is.
Original comment by mcgra...@gmail.com
on 4 Apr 2008 at 1:35
Christian,
Regarding the plans for the CG solver, if you look in the latest FDS Tech Guide,
there a description of our new "pressure correction" algorithm, which we need to
enforce volume conservation from mesh to mesh in a multi-mesh calculation (see
the
appendix on domain decomposition). Within this algorithm we need to solve a
linear
system on the coarse mesh (similar to the coarse solve in a multi-grid method,
only
we do not have a series of refined pre- and post-smoothings surronding the
coarse
solve). At the moment we use a direct LU decomposition for this solve. For a
small
number of meshes (my guess is <1000) I don't expect to see a significant time
hit
from the inefficiency of the direct solve. But, given that the matrix for the
linear
system is M x M (where M is the number of meshes), symmetric and
positive-definite,
CG will likely be the best choice for this coarse solve in the long run (note
that
this coarse solve is not parallelized at this point -- it is performed
redundantly on
each cpu). The linear solve on the fine grid will continue to be done with the
FISHPAK FFT solver independently on each mesh.
Cheers,
Randy
Original comment by randy.mc...@gmail.com
on 4 Apr 2008 at 2:00
Original comment by randy.mc...@gmail.com
on 23 Jul 2008 at 12:42
Original issue reported on code.google.com by
crog...@gmail.com
on 17 Mar 2008 at 2:50