Enhancement: FDS with more than 99 Meshes

GoogleCodeExporter commented 9 years ago

I would like to ask if there is a possibility to use FDS with more than 99 
Meshes (and cores in MPI). Actually the files for the different meshes are 
numbered 01, 02, ..., 98, 99, thus 100 or more meshes are not possible, 
because the filename is written by I2.2 format. It would be great, if this 
option could be implemented in a future version, like I3.3 formatting, for 
example, thus we can calculate with 100 and more meshes.

Best Regards,
Christian

Original issue reported on code.google.com by crog...@gmail.com on 17 Mar 2008 at 2:50

GoogleCodeExporter commented 9 years ago

Agreed. I think that file formatting is the only issue. I do not think that 
there is 
any particular limitation as far as FDS is concerned. I'll make the change.

Original comment by mcgra...@gmail.com on 17 Mar 2008 at 4:14

Changed state: Accepted
Added labels: OpSys-All, Usability

GoogleCodeExporter commented 9 years ago

I committed some changes that should allow you to run between 100 and 999 
meshes. I 
am still using somewhat old-fashioned Fortran write statements. I am now able 
to run 
a 128 mesh case, and I think I fixed all the formatting problems. I will mark 
as 
Fixed for now, but let me know if you still have problems. We will probably 
still 
suffer some growing pains.

BTW, we are currently testing a new pressure correction scheme that requires 
more 
MPI exchanges per time step. Unfortunately, we cannot determine whether some of 
the 
strange scaling results we are seeing are due to our linux cluster or MPI 
programming. Would it be possible to do a few tests on your machine? Last time 
we 
did this exercise, your machine was perfect in that the CPUs were all almost 
100% 
busy during the whole run. Our machines have the same problem as Dave McGill's 
cluster in Canada. It always seems that one of the processors on a multiple 
processor machine is more active/efficient than the other.

Original comment by mcgra...@gmail.com on 17 Mar 2008 at 5:06

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Thanks for the changes.
We can do some tests on the machine I use, if this will help you. I think you 
should 
create some simple examples and post it (or send to my email adress), then I 
will 
start them at the machine. If you create the test cases, use, if it is 
possible, 
only 32, 64, 96, 128 ... (+32) meshes, thus this is the best for this machine. 
Each 
node consists of 32 cores shared memory, thus a x * 32mesh case is optimal for 
this 
machine.

Original comment by crog...@gmail.com on 18 Mar 2008 at 10:43

GoogleCodeExporter commented 9 years ago

OK, we will put together a suite of test cases with meshes of 32, 64 and 128, 
and 
post them when they are ready to this Tracker. I will keep this marked Fixed, 
but 
Open so we can use it to exchange info.

Original comment by mcgra...@gmail.com on 24 Mar 2008 at 12:02

GoogleCodeExporter commented 9 years ago

Christian,

I have created four input files with 4, 16, 64, and 256 meshes. Sorry, I just 
noticed
that you requested +32 increments.  I can create more if we need, but hopefully 
the
64 and 256 cases should fulfill the requirement.  You should be able to 
retrieve the
files from the link below.  Let me know if you have any problems.  The cases 
should
each run for about 100 time steps.  Could you please post the .out files for 
each run?

Thanks!
Randy

http://groups.google.com/group/fds-smv/web/Bernardo_Trails.zip

Original comment by randy.mc...@gmail.com on 24 Mar 2008 at 10:44

GoogleCodeExporter commented 9 years ago

Randy,

I startet all the files but the machine has some problems with the memory 
allocation, thus I had no success. I will contact next week the people from 
support, 
if they have an explanation for these problems. The problem occurs with all 
files 
(4, 16, 64 and 256). I attached you my error messages file of the 256 mesh 
case, 
thus you can see what messages I got.
I will notice you next week.

Best Regards,
Christian

PS: Just one suggestion: We should also test cases with 32 and 128 meshes. 
Because 
up to 32 meshes we have shared memory on the machine, this means no "cable MPI 
transfer", only MPI via memory. If we use 64, 128, 256 meshes, we can calculate 
the "cable MPI sending time", if we know the memory MPI time from the 4, 16 and 
32 
case.

Original comment by crog...@gmail.com on 28 Mar 2008 at 12:28

Attachments:

err.857086.0

GoogleCodeExporter commented 9 years ago

Christian,

Thanks for giving this a try.  I did check to see that each of these cases ran 
on our
Linux cluster, so I will have to defer to Kevin to see if he has an explanation 
for
the error messages you are seeing.  In the mean time, I will put the 32 and 128 
mesh
input files together.

Cheers,
Randy

Original comment by randy.mc...@gmail.com on 28 Mar 2008 at 12:36

GoogleCodeExporter commented 9 years ago

Christian,

I ran the 4 mesh case (worst case for memory) in debug mode and did not have any
problems, except that is looked like process zero was taking about twice the 
memory
of the other processes.  Can you try to reducing the z resolution on each of the
cases?  Cut it is half until until you see that the cases can fit in memory.  
So, for
example, change the MESH line from --> to

&MESH IJK=200,200,40,... --> &MESH IJK=200,200,20,...

(NOTE: J should be 200, by the way, for the 4 mesh case! I think I made a 
mistake and
it might be 100 in the file I posted.)

for all the MESH lines in each of the files.  This should halve the memory
requirement.  Maybe even try K=10 if 20 does not work.

Cheers,
Randy

Original comment by randy.mc...@gmail.com on 28 Mar 2008 at 1:33

GoogleCodeExporter commented 9 years ago

Ok, actually I was wrong in comment #8 about n0 using twice the memory.  I did 
not
realize Glenn Forney was also using the cluster.  I just did a clean run and it 
looks
like the memory is balanced.  However, I did run into some intermittent problems
until I cut the K resolution to 20.  So, give that a try and let me know if it 
helps.

Thanks,
Randy

Original comment by randy.mc...@gmail.com on 28 Mar 2008 at 1:53

GoogleCodeExporter commented 9 years ago

Randy,

I tested the 4 Mesh case with &MESH IJK=200,200,20,... but there are still 
problems. 
I think, this is a problem which is based on the machine. RAM is not the 
problem, 
(3.5 GB per core), this can be devided in STACK and DATA "RAM".

I used the standard values (3.0GB is DATA, 0.5GB is STACK), but I also changed 
the 
values. This had no success. The definition of STACK and DATA is:

In FORTRAN terms, stack is used for:
- code compiled with XLF compiler option "-qnosave"
- subprogram calling information
- local variables, including arrays, unless they are marked SAVE
Data is used for:
- code compiled with XLF compiler option "-qsave"
- program code
- static variables, including COMMON variables and variables marked SAVE
- memory allocated by ALLOCATE - known as 'heap' variables
- buffers allocated by MPI
- the I/O system

I told my problem the "Supercomputer Team" from JSC, Jülich. If I got an 
answer or 
solution, I will post the solution (and hopefully the results) to the 
issue-tracker.

Best Regard,
Christian

Original comment by crog...@gmail.com on 2 Apr 2008 at 8:16

GoogleCodeExporter commented 9 years ago

Randy,

The problem was the compiling. With help of Armin Seyfried and Bernd Koerfgen 
from 
Jülich Supercomputing Centre I was able to compile the code and it works (I 
compiled 
32bit, 64bit was the solution). If all 4 cases are finished I will post the 
.out 
file for each case. I actually compiled only with -03 -q64 option, other 
settings 
could produce a faster code, but I have to test if "aggressive" optimization 
produces the same results. Here are the makefile settings I used for compiling 
on an 
AIX-System (maybe it could be added to the makefile):

#AIX, JUMP, MPI-Version
AIX_MPI : FFLAGS  = -O3 -q64
AIX_MPI : CFLAGS  = -O3 -Dpp_noappend -q64
AIX_MPI : FCOMPL  = mpxlf90
AIX_MPI : CCOMPL  = mpcc
AIX_MPI : obj     = fds5_jump_mpi_64
AIX_MPI : $(obj_mpi)
    $(FCOMPL) $(FFLAGS) -o $(obj) $(obj_mpi)

Original comment by crog...@gmail.com on 2 Apr 2008 at 10:20

GoogleCodeExporter commented 9 years ago

Christian,

That is great.  Thanks.  I will add the makefile info.

I had mentioned that I would create other input files for +32 mesh cases.  I 
just
posted a new Bernardo_Trails2.zip to the discussion group:

http://groups.google.com/group/fds-smv/web/Bernardo_Trails2.zip

There is now a 32 grid case and a 128 grid case.  But note that they all use 20 
cells
in z.  So if you decide to use these make sure that all the other runs also use 
the
20 cells in z.

Cheers,
Randy

Original comment by randy.mc...@gmail.com on 2 Apr 2008 at 11:37

GoogleCodeExporter commented 9 years ago

Randy,

here are the .out files for the original cases with &MESH IJK=200,100,40,...
The 256 MESH cases needs more than 10 minutes to start the calculation, I think 
this 
is based on the high MPI traffic for "Mesh finding".

If there are no changes in your 32 and 128 mesh test cases than changing z from 
20 
to 40 I will start this cases, too. Please confirm, that I only have to change 
the z-
value from 20 to 40, thus they are equal with the "old" cases.

Best Regards
Christian

Original comment by crog...@gmail.com on 3 Apr 2008 at 12:21

Attachments:

GoogleCodeExporter commented 9 years ago

Christian,

Thanks!  Yes, the only change in the 32 and 128 cases is the z dimension.

However, I am a little worried about your 4 mesh case.  In the note you just 
posted
you mentioned "&MESH IJK=200,100,40".  I pointed out earlier that this was a 
mistake
I made in the 4 mesh input file... it should be "&MESH IJK=200,200,40".

Sorry!  Can you double check that this is correct in the case you actually ran,
otherwise the scaling results will not be relevant.

Best,
Randy

Original comment by randy.mc...@gmail.com on 3 Apr 2008 at 12:35

GoogleCodeExporter commented 9 years ago

Randy,

I forgot the changes... here are the results of the corrected 4 Mesh case 
with "&MESH IJK=200,200,40..."

I will also change the z coordinate in the 32 and 128 mesh case. The results 
will 
follow.
Regards,
Christian

Original comment by crog...@gmail.com on 3 Apr 2008 at 12:58

Attachments:

case_4mesh.out

GoogleCodeExporter commented 9 years ago

Now here is the 32 mesh result with z = 40.
The 128 mesh case is in queue, I think it will be finished tomorrow.

Original comment by crog...@gmail.com on 3 Apr 2008 at 1:16

Attachments:

case_32mesh.out

GoogleCodeExporter commented 9 years ago

Randy,

the machine in Jülich was very fast, thus I can present the 128mesh result. If 
you 
have finished the "speed-up" analysis, I would be very interested in it.

If you need some other files from the test cases, please write, thus I can 
download 
it from the machine.

Best Regards
Christian

Original comment by crog...@gmail.com on 3 Apr 2008 at 2:15

Attachments:

case_128mesh.out

GoogleCodeExporter commented 9 years ago

Christian,

Thanks!  This is excellent. I am working on getting the timings together right 
now. 
The load balancing looks very good, but it looks like we are not accounting for 
the
wall clock time in the time step loop correctly yet.  That is, the cpu time in 
the
subroutines does not seem to add up to the total cpu time in main_mpi.  As you
pointed out, this probably has to do with the geometry set up in some way... 
but this
still does not seem to account for everything.  So, Kevin and I are working to 
get
this sorted out and will get back to you asap.

Again, many thanks for running these cases!

Cheers,
Randy

Original comment by randy.mc...@gmail.com on 3 Apr 2008 at 2:21

GoogleCodeExporter commented 9 years ago

I just poked around with Google and found this discussion of CPU_TIME in AIX 
Fortran. CPU_TIME is what we use to get the total CPU time of the calculation. 
It is 
a standard call in Fortran 95, but its interpretation is compiler dependent. 
What we 
need to know is how your machine is interpreting CPU_TIME. Apparently, AIX 
assumes, 
by default, that CPU_TIME is the total user and system time. To change this, 
you 
need to set an XLFRTEOPTS environment variable. If I recall from my days when I 
used 
to run on an IBM, I would do an "env" command to see what the settings were for 
the 
machine. For more details --

http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?
topic=/com.ibm.xlf101a.doc/xlflr/cpu_time.htm

Original comment by mcgra...@gmail.com on 3 Apr 2008 at 2:47

GoogleCodeExporter commented 9 years ago

Christian,

The scaling looks good!  I have attached a few files below.  The first is a 
jpeg of
the scaling.  The next is the Matlab script I used to generate the plot in case 
you
want to change it at all (do you use Matlab?).  There is also an Excel file 
where I
computed the wall clock time.  I took these timings from the time stamp on the
iteration from the .out file.  The other .dat files that I have posted are just 
the
total wall clock time for each cpu and then there is a Matlab script to read 
these
files and generate a bar chart showing the cpu load.

It is interesting to note in the scaling plot that the best improvement comes 
from
going from 32 to 64 meshes.  Up to 32 processes the cores are accessing the same
memory.  As we get up to 128 processes we are starting to hit one of two limits,
either the MPI "surface to volume" limit or we might be seeing the limit of the
direct linear solve (Gauss Jordan Elimination) on the coarse grid Poisson 
solve.  I
doubt that the coarse grid solve is a problem at this point (inverting a 256x256
matrix should not be too hard) but I am not sure... eventually we will use 
Conjugate
Gradient here for large, multi-processor runs.

Thanks for your help with all this!
Cheers,
Randy

Original comment by randy.mc...@gmail.com on 3 Apr 2008 at 3:55

Attachments:

GoogleCodeExporter commented 9 years ago

Kevin,

you are right with your assumption, that CPU_TIME is sum of 'total user time' 
and 'system time'. It is possible to change, so if you need a "special" 
CPU_TIME 
please let me know. The possibilities are shown in the link you posted.

Randy,
I do not us Matlab, I use gnuplot for creating graphs. This is free and 
sufficient 
for my usage. Can you explain what your plannings with a CG for large multi-
processor runs are? Have you tried to implement a CG solver in the code, or is 
this 
just "the last solution" for large multi-processor runs?

Regards,
Christian

Original comment by crog...@gmail.com on 4 Apr 2008 at 1:28

GoogleCodeExporter commented 9 years ago

The current CPU_TIME setting is OK. We need to put more subroutine timers into 
the 
code to determine where the system is idling in the 256 mesh case. Ideally, we 
want 
to account for where the computers are either working or idling by summing up 
the 
CPU times for all the major parts of the code and checking that they add up to 
something close to the total CPU time. In the 256 mesh case, the MAIN CPU usage 
is 
far greater than the sum of the subroutine CPU usage. This means that we are 
not 
counting all the routines and do not know where the waste is.

Original comment by mcgra...@gmail.com on 4 Apr 2008 at 1:35

GoogleCodeExporter commented 9 years ago

Christian,

Regarding the plans for the CG solver, if you look in the latest FDS Tech Guide,
there a description of our new "pressure correction" algorithm, which we need to
enforce volume conservation from mesh to mesh in a multi-mesh calculation (see 
the
appendix on domain decomposition).  Within this algorithm we need to solve a 
linear
system on the coarse mesh (similar to the coarse solve in a multi-grid method, 
only
we do not have a series of refined pre- and post-smoothings surronding the 
coarse
solve).  At the moment we use a direct LU decomposition for this solve.  For a 
small
number of meshes (my guess is <1000) I don't expect to see a significant time 
hit
from the inefficiency of the direct solve. But, given that the matrix for the 
linear
system is M x M (where M is the number of meshes), symmetric and 
positive-definite,
CG will likely be the best choice for this coarse solve in the long run (note 
that
this coarse solve is not parallelized at this point -- it is performed 
redundantly on
each cpu).  The linear solve on the fine grid will continue to be done with the
FISHPAK FFT solver independently on each mesh.

Cheers,
Randy

Original comment by randy.mc...@gmail.com on 4 Apr 2008 at 2:00

GoogleCodeExporter commented 9 years ago

Original comment by randy.mc...@gmail.com on 23 Jul 2008 at 12:42

Changed state: Verified

Yinan-Scott-Shi / fds-smv

Enhancement: FDS with more than 99 Meshes #302