Scaling issue with sample parallel/rectangular_cross test case

samcom12 commented 2 years ago

Hi, I tried benchmarking latest run_parallel rectangular_cross.py script. It seems the EVOLVE loop not scaling at all while increasing number of processes.

rectangular_test_default_param_46_MPI.txt rectangular_test_default_param_460_MPI.txt

stoiver commented 2 years ago

@samcom12 thanks for looking at the run_parallel_rectangular_cross.py script.

I have been playing with that script so that I could easily change the mesh size among other parameters at the command line. The default at the moment is running with just 40000 triangles. I think (at least at the moment) we can't expect any speed up after the sub domains become smaller than say1000 triangles.

You can run with a large mesh by using the --sqrtN argument. For instance

python  run_parallel_rectangular_cross.py --sqrtN 500

should run with 1_000_000 triangles (500 x 500 x 4).

I tested up to 9_000_000 triangles with python 2 with pypar and obtained the same behaviour as the current python 3 with mpi4py. 50% scalability when number_of_triangles / number_of_processes approx 10000

I am investigating overlapping communication with computation to see if we can improve the scalability.

I am surprised that the current results are not as good as the results we saw in old study. I'm wondering if that was on a machine that had much faster communication infrastructure relative to computation.

stoiver commented 2 years ago

@samcom12 here is a plot of my scalability experiments on our NCI gadi system using run_parallel_rectangular_cross.py for 1M (million), 4M and 9M triangles (sqrtN set to 500, 1000, 1500)

scalability_anuga_3_1_9

Here is the data on which this is based:

import pandas
from io import StringIO

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

def process_table(table):
    df = pandas.read_csv(StringIO(table))
    scaling = df['etime'][0]/df['etime']
    perfect = df['np']/df['np'][0]
    return df['np'], scaling, perfect

"""
np == number of processes
ntri == number of triangles
ctime == time to create sequential domain
dtime == time to distribute sequential domain to parallel domains
etime == time to evolve
"""
rectangle_nci_1M = """np,ntri,ctime,dtime,etime
48,1000000,3.009163,7.375380,2.454851
96,1000000,2.938114,8.417132,1.357496
192,1000000,3.260445,11.742471,1.068297
384,1000000,2.963929,10.158751,1.415354
768,1000000,2.992660,14.267908,1.500077
"""

rectangle_nci_4M = """np,ntri,ctime,dtime,etime
48,4000000,12.563187,30.908950,16.569530
96,4000000,12.359587,32.763117,9.548151
192,4000000,12.137918,36.419802,4.486357
384,4000000,12.819002,39.385591,3.005332
768,4000000,12.726759,46.763928,2.557212
1536,4000000,12.353777,69.391602,2.099987
"""

rectangle_nci_9M = """np,ntri,ctime,dtime,etime
48,9000000,32.486905,72.105292,19.746415
96,9000000,30.466313,72.639990,9.603525
192,9000000,31.240693,79.717921,5.500745
384,9000000,30.858758,84.591170,2.704646
768,9000000,29.903054,99.747053,2.618381
1536,9000000,30.112862,136.545405,1.725584
"""

std_1M_np  , std_1M_scaling, std_1M_perfect = process_table(rectangle_nci_1M)
std_4M_np  , std_4M_scaling, std_4M_perfect = process_table(rectangle_nci_4M)
std_9M_np  , std_9M_scaling, std_9M_perfect = process_table(rectangle_nci_9M)

base = 2 

plt.loglog(std_9M_np,std_9M_scaling, '-o', base=base, label='std 9M')
plt.loglog(std_4M_np,std_4M_scaling, '-o', base=base, label='std 4M')
plt.loglog(std_1M_np,std_1M_scaling, '-o', base=base, label='std 1M')

plt.loglog(std_4M_np,std_4M_perfect, base=base, label='100% efficency')
plt.loglog(std_4M_np,std_4M_perfect/2, base=base, label='50% efficency')
plt.legend()
plt.xlabel('cores')
plt.ylabel('speedup');
plt.title('Scaling on NCI gadi');

samcom12 commented 2 years ago

Thanks @stoiver ,

It shows good scaling with your machine. I have below configurations of system and used below installation steps for ANUGA.

System Configuration- Intel-Cascedalake nodes, Mellanox InfiniBand interconnect, intel-oneapi-mpi-2021.6.0, GCC-12.2.0, ANUGA latest version, MPI4PY latest development version.

Installation steps-

spack load intel-oneapi-mpi@2021.6.0
spack load gcc@12.2.0

CFLAGS="-O3 -march=cascadelake -mtune=cascadelake"
CXXFLAGS="-O3 -march=cascadelake -mtune=cascadelake"
FCFLAGS="-O3 -march=cascadelake -mtune=cascadelake"

export I_MPI_CC=`which gcc`
export I_MPI_CXX=`which g++`
export I_MPI_FC=`which gfortran`

export MPICC=`which mpiicc`
export MPICXX=`which mpiicpc`
export MPIFC=`which mpiifort`

conda create -n anuga_gcc12 -c conda-forge python=3.7
conda activate anuga_gcc12

conda install -c conda-forge  cython dill future gitpython matplotlib netcdf4 numpy  pytest pytz scipy setuptools wheel pip -y
pip install --no-cache pmw utm meshpy pymetis pytriangle backports.zoneinfo tzdata
git clone https://github.com/mpi4py/mpi4py.git
python setup.py build -f
python setup.py install

git clone https://github.com/anuga-community/anuga_core.git
python setup.py build -f
python setup.py install

samcom12 commented 2 years ago

And SLURM script-

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=48
#SBATCH --exclusive
#SBATCH -t 01:00:00
#SBATCH -p small

source /home/samir/miniconda3/bin/activate
conda activate anuga_gcc12
#spack load intel-oneapi-mpi@2021.6.0
source /scratch/apps/spack/opt/spack/linux-centos7-cascadelake/gcc-11.2.0/intel-oneapi-mpi-2021.6.0-wteqvl2un3dyqlevcmzy3y3naxneb4tt/mpi/2021.6.0/env/vars.sh

(time mpiexec.hydra -np $SLURM_NTASKS -ppn 48 /home/samir/miniconda3/envs/anuga_gcc12/bin/python -u /home/samir/samir/anuga_core/examples/parallel/run_parallel_rectangular.py --sqrtN 500 ) 2>&1 | tee $SLURM_NNODES\_nodes_$SLURM_NTASKS\_MPI_ranks_48_PPN_$SLURM_JOBID

rm sw_rectangle*

stoiver commented 2 years ago

A description of the NCI Gadi system can be found here. In particular the above results were run on standard nodes with the following characteristics:

3,074 nodes each containing two 24-core Intel Xeon Scalable ‘Cascade Lake’ processors and 192 Gigabytes of memory
Linking the storage systems and Gadi is Mellanox Technologies' latest generation HDR InfiniBand technology in a Dragonfly+ topology, capable of transferring data at up to 200 Gb/s.

samcom12 commented 1 year ago

HI @stoiver , Can you share your installation steps, MPI library used?

We see a everse scaling with OpenMPI-4.1.4 version.

stoiver commented 1 year ago

@samcom12 Here are the NCI pre-compiled modules I use:

module load python3/3.9.2
module load openmpi/4.1.4
module load gdal/3.5.0
module load netcdf/4.8.0
module load hdf5/1.12.1
module load python3-as-python

The recommendation from our NCI technical staff is to use pip to install from source for all the other python packages needed for anuga (including anuga).

I note that you are using conda as your anuga setup. I will try that on gadi and see if that causes problems.

stoiver commented 1 year ago

@samcom12, I thought the problem for you might have been using the binary conda-forge packages, but as you can see, on NCI gadi, the conda installation gives reasonable scalability, at least as good as the installation from source.

scalability_anuga_conda_3_1_9

I have also rerun the experiments with my standard installation but this time ensuring each run is run from a separate folder. This seems to have improved the scalability results. Perhaps there are subtle conflicts when simultaneously running multiple simulations from the same folder?

scalability_anuga_3_1_9_updated

samcom12 commented 1 year ago

Hello @stoiver ,

Greetings to you!

The latest commit breaks the domain.print_statistics() call. With the rectangular cross test case with verbose mode, we gets some error related to "TypeError: unsupported operand type(s) for //: 'str' and 'int'".

Just FYR.

cheers, Samir

stoiver commented 1 year ago

@samcom12 , fixed the domain.print_statistics() problem.

Was created by our automated conversion from old_div call to //.

Needed to change

str += '    Percentiles (%g percent):\n' % 100//nbins

to

str += '    Percentiles (%g percent):\n' % (100//nbins)

stoiver commented 1 year ago

@samcom12, just an update that I have changed the naming convention for the anuga.log file. It now has a time stamp in its name, like anuga_20221209_143822.log. I think when running a script multiple times from same directory (as in multiple parallel batch jobs with different number of processes), then the multiple jobs would be writing to the same anuga.log file.

samcom12 commented 1 year ago

Thanks @stoiver !! Now we could run and scale the sample case with 1M triangles.

As an other experiment we were trying to run -

mpirun -np 480 python run_parallel_rectangular.py -sn 5000 -v

on 10 cascade lake nodes. Which gives us the BUS error signal 7 during evolve.

Do you see with above combination if internal code breaks cold happen?

cheers, Samir

stoiver commented 1 year ago

@samcom12 , my guess is that with sn =5000 the number of triangles is 100_000_000 which possibly is causing a memory problem. I must admit with our python coding, the memory usage is quite large. We first create a sequential domain on processor 0, then replicate it, again on processor 0, and then distribute the sub domains to the other processors. You should manually run the script on one processor and verify the memory use. To run a very large problem, we suggest running a sequential job on a large memory node, and dump the partitions to files and then run the parallel job to evolve. There are a pair of scripts in the example directory to do that.

anuga-community / anuga_core

Scaling issue with sample parallel/rectangular_cross test case #27