Use of ASYNC progress in ANUGA

samcom12 commented 1 year ago

Hi @stephen,

Greetings of the day!! In ANUGA manual we see a description about ASYNC MPI communication. But, when we benchmark it we are not seeing any performance gain.

Maybe as MPI4PY doesn't allow it to be used with ANUA. Any comments from your end?

Cheers, Samir Shaikh

stoiver commented 1 year ago

Hi Samir,

It would be interesting to see some of your benchmark results.

There are also some examples which you could use for benchmarks, ie

anuga_core/examples/parallel/run_parallel_rectangular.py
anuga_core/examples/parallel/run_parallel_tsunami.py
anuga_core/examples/parallel/run_parallel_merimbula.py

The first two examples are setup to increase the size of the triangulation as the number of processes increase. We could hope that the evolve time should be fairly constant.

The third example is a simple example with a triangulation of 10000 triangles. With such a small mesh we can't expect much more than a speedup of 10.

At the present I don't have access to a large machine so it would be great to get some timings on these example programs.

On my 4 core laptop I get the following

mpiexec -np n python -u run_parallel_rectangular.py n, N, cputime 1, 40000, 2.29 2, 62500, 2.14 3, 82944, 2.75 4, 99856, 2.72 5, 115600, 2.73 6, 131044, 2.91 7, 145924, 3.12 8, 160000, 3.49

mpiexec -np n python -u run_parallel_tsunami.py n, N, cputime 1, 10000, 3.92 2, 15376, 3.47 3, 20736, 4.46 4, 24964, 5.03 5, 28900, 5.71 6, 32400, 5.93

mpiexec -np n python -u run_parallel_merimbula.py n, N, cputime 1, 10785, 2.58 2, 10785, 1.39 3, 10785, 1.21 4, 10785, 1.08 5, 10785, 0.94 6, 10785, 0.90 7, 10785, 0.88 8, 10785, 0.95

Cheers Steve

============================== Emeritus Prof Stephen Roberts Mathematical Sciences Institute Hanna Neumann Building #145 The Australian National University Canberra, ACT 2600 AUSTRALIA T: +61 2 61254445 | E: @.***

CRICOS Provider: 00120C

From: Samir Shaikh @.> Sent: Saturday, 30 July 2022 7:48 PM To: anuga-community/anuga_core @.> Cc: Subscribed @.***> Subject: [anuga-community/anuga_core] Use of ASYNC progress in ANUGA (Issue #23)

Hi @stephenhttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fstephen&data=05%7C01%7Cstephen.roberts%40anu.edu.au%7C95c563bef762436dd82a08da7214dcff%7Ce37d725cab5c46249ae5f0533e486437%7C0%7C0%7C637947731184222711%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jet%2Bf29Rug5ioUfQJBEuTT9V9%2BpGTR%2Fli9G3uctyxm4%3D&reserved=0,

Greetings of the day!! In ANUGA manual we see a description about ASYNC MPI communication. But, when we benchmark it we are not seeing any performance gain.

Maybe as MPI4PY doesn't allow it to be used with ANUA. Any comments from your end?

Cheers, Samir Shaikh

— Reply to this email directly, view it on GitHubhttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fanuga-community%2Fanuga_core%2Fissues%2F23&data=05%7C01%7Cstephen.roberts%40anu.edu.au%7C95c563bef762436dd82a08da7214dcff%7Ce37d725cab5c46249ae5f0533e486437%7C0%7C0%7C637947731184222711%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=I9SrgJuFYTaInujrIbu09Amom96QRGRDx39FltDhsaU%3D&reserved=0, or unsubscribehttps://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABKOKJG3KRLKI5BJE6LVOMLVWT6PTANCNFSM55C74V3A&data=05%7C01%7Cstephen.roberts%40anu.edu.au%7C95c563bef762436dd82a08da7214dcff%7Ce37d725cab5c46249ae5f0533e486437%7C0%7C0%7C637947731184222711%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=sJhfxXz2vAUvyg6FGQKaD2rf1voLqjo5ROeyBUlq1e8%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.***>

samcom12 commented 1 year ago

Hi Stephen,

I have ran same example code (available in latest branch) on Intel cascade lake nodes having 48 cores per node.

Below is a table for same.

samcom12 commented 1 year ago

stoiver commented 1 year ago

Hi @samcom12, thanks for doing the benchmarking. Interesting results. Makes me realise that I should tweak the example scripts a little.

You might be interested in some parallel benchmarking I did quite a while ago. Here is a link to the conference paper https://www.researchgate.net/publication/320046756_High_Resolution_Tsunami_Inundation_Simulations. That was using the old python 2 with pypar. I am hoping the current python 3 with mpi4py acts in a similar fashion. If not then will need to track down what is happening.

The structure of a parallel anuga script is:

(1) Create sequential domain on process 0. (2) Distribute the sequential domain to sub domains on all the processes. (3) Run the evolve loop.

The first step (sequential creation) is not parallel and so gains no advantage from parallelisation. As we work on larger problems this component will become expensive, both in cputime and memory. But it can be setup so that this step can be used for multiple simulations.

The second step involves distributing the domain, and so involves parallel communication. But probably not too expensive. Once again we can save the results of this step and use for multiple simulations.

The third step (evolve) is typically the most time consuming step. This is the most communication intensive component. It is this step where we should concentrate on parallel benchmarking. The example scripts have a small evolve duration, but in practice we would expect running for much longer durations, and so this is where we need to concentrate our efforts to speed up our simulations.

With the benchmarking results you provided, did you measure evolve time, or complete wall time?

I will tweak the example scripts to output creation, distribute and evolve time. And that should give us a better idea of the characteristics of our parallel code.

We have the run_parallel_merimbula.py which is a very small mesh with approx 10,000 triangles. So is probably not a good test on a large number of processes.

I will add an example like run_parallel_rectangular.py but with a large mesh with say 2,000,000. That will be an interesting scaling test. I would hope we will see reasonable scaling up to 500 processes (see the plot in the paper I referenced above).

I will tweak the example to address these comments.

stoiver commented 1 year ago

@samcom12 I have just uploaded a tweaked version of run_parallel_rectangular.py and run_parallel_tsunami.py. I have just changed the creation of the domain so that it has a fixed number of triangles = 1,000,000. Also have printed out the time to create, distribute and evolve the domain.

samcom12 commented 1 year ago

Hi Stephen,

We tried running with latest code changes. Attaching slurm output FYR. slurm-24484.txt slurm-24485.txt slurm-24486.txt slurm-24487.txt slurm-24488.txt slurm-24489.txt

In each slurm output first output is for rectangular test case and second for Tsunami.

stoiver commented 1 year ago

Hi @samcom12, thanks for the new runs. You are right that the scaling (especially for the tsunami example) doesn't look too good. Here is my notebook output with a plot at the end. scaling_example.pdf

Strange as the underying model for both rectangular and tsunami uses the same mesh.

Do you know that your batch jobs get full access to the nodes and are not influenced by other jobs.

Also what communication network does your machine use?

I will try to run my own large benchmarks and see if I can find the problem on my end.

samcom12 commented 1 year ago

Hello Stephen,

We are using SLURM scheduler and we get dedicated nodes. we are using infiniband network for our cluster.

samcom12 commented 1 year ago

Hi Stephen,

Thanks for your continuous support to our queries for ANUGA. We are using ANUGA for simulating flood across 11 thousand square kilometer area approximately. With use of granular mesh it generates roughly around 6 crores triangles. I need your input on below points-

What will be impact of changing parameters like CFL from default 1.0 to 2.0, minumium allowed height from default 1 mm to 1 cm, maximum allowed speed to 1.0
As you mention about structure of parallel ANUGA script (1) Create sequential domain on process 0. (2) Distribute the sequential domain to sub domains on all the processes. (3) Run the evolve loop.

In our scenario we are allocating some nodes like 90 for whole simulation. Can we optimize upon step (1) and step (2) being done on a single high memory node and dump data in some files and then start step (3) EVOLVE loop?

We keep on updating rainfall data at each yield step of 3 hours in EVOLVE loop.

anuga-community / anuga_core

Use of ASYNC progress in ANUGA #23