amrvac / AGILE-experimental

MPI-AMRVAC: A Parallel Adaptive Mesh Refinement Framework
https://amrvac.org/
GNU General Public License v3.0
2 stars 1 forks source link

Difference in baseline performance on Snellius #14

Closed laurasootes closed 4 months ago

laurasootes commented 4 months ago

@oporth There seems to be a difference in obtained baseline performance. On Snellius, for a run on 1 node using the nvidia jobfile

The first run (22 mei):



| __ __ ____ ____ | | | / | | / | / | / / / | | | | |/| | |) | |____ / | |/| | |) / / _ | | | | | | | | /| |____/ | | | | _ < V / | | | || ||_| || // | ||| // ____| |


Reading amrvac.par

Output type | tsavestart | dtsave | ditsave | itsave(1) | tsave(1) log | 0.000E+00 | | 10 | 0 | normal | 0.000E+00 | * | ** | ** | * slice | 0.000E+00 | * | ** | ** | * collapsed | 0.000E+00 | * | ** | ** | * analysis | 0.000E+00 | * | ** | ** | *

Warning: coordinate system is not specified! call set_coordinate_system in usr_init in mod_usr.t Now use Cartesian coordinate Domain size (cells): 4096 4096 Level one dx: 0.244E-03 0.244E-03 Refine estimation: Lohner's scheme restart_from_file: undefined converting: F

3D HD KH --assuming y ranging from 0-1! --density ratio: 10.00000000000000
--kx: 12.56637061435917
--vextra: 0.000000000000000

Startup phase took : 0.425 sec

Start integrating, print status every 3.00E+01 seconds

it time dt wc-time(s)

0 0.0000E+00 1.0105E-05 1.0483E-01

51 2.5608E-03 5.3181E-05 3.0196E+01

Total timeloop took : 58.288 sec Time spent on AMR : 0.000 sec Percentage: 0.00 % Time spent on IO in loop : 0.428 sec Percentage: 0.73 % Time spent on ghost cells : 15.626 sec Percentage: 26.81 % Time spent on computing : 42.234 sec Percentage: 72.46 % Cells updated / proc / sec : 4.497E+05

Saving visual data. Coordinate directions and variable names are: 1 X
2 Y
3 rho
4 v1
5 v2
6 p
time = 5.1665764415896450E-003

Total time spent on IO : 31.524 sec Total timeintegration took : 89.384 sec

100 5.167E-03 5.318E-05 5.829E+01


Finished AMRVAC in : 89.809 sec

JOB STATISTICS

Job ID: 6331449 Cluster: snellius User/Group: lootes/lootes State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 192 CPU Utilized: 05:16:22 CPU Efficiency: 71.64% of 07:21:36 core-walltime Job Wall-clock time: 00:02:18 Memory Utilized: 336.64 GB (estimated maximum) Memory Efficiency: 100.19% of 336.00 GB (336.00 GB/node)

The second run (28 mei):


| __ __ ____ ____ | | | / | | / | / | / / / | | | | |/| | |) | |____ / | |/| | |) / / _ | | | | | | | | /| |____/ | | | | _ < V / | | | || ||_| || // | ||| // ____| |


Reading amrvac.par

Output type | tsavestart | dtsave | ditsave | itsave(1) | tsave(1) log | 0.000E+00 | | 10 | 0 | normal | 0.000E+00 | * | ** | ** | * slice | 0.000E+00 | * | ** | ** | * collapsed | 0.000E+00 | * | ** | ** | * analysis | 0.000E+00 | * | ** | ** | *

Warning: coordinate system is not specified! call set_coordinate_system in usr_init in mod_usr.t Now use Cartesian coordinate Domain size (cells): 4096 4096 Level one dx: 0.244E-03 0.244E-03 Refine estimation: Lohner's scheme restart_from_file: undefined converting: F

3D HD KH --assuming y ranging from 0-1! --density ratio: 10.00000000000000
--kx: 12.56637061435917
--vextra: 0.000000000000000

Startup phase took : 0.354 sec

Start integrating, print status every 3.00E+01 seconds

it time dt wc-time(s)

0 0.0000E+00 1.0105E-05 4.9008E-03

Total timeloop took : 13.279 sec Time spent on AMR : 0.000 sec Percentage: 0.00 % Time spent on IO in loop : 0.260 sec Percentage: 1.96 % Time spent on ghost cells : 0.821 sec Percentage: 6.19 % Time spent on computing : 12.197 sec Percentage: 91.86 % Cells updated / proc / sec : 1.974E+06

Saving visual data. Coordinate directions and variable names are: 1 X
2 Y
3 rho
4 v1
5 v2
6 p
time = 5.1665764415896450E-003

Total time spent on IO : 20.906 sec Total timeintegration took : 33.925 sec

100 5.167E-03 5.318E-05 1.328E+01


Finished AMRVAC in : 34.279 sec

JOB STATISTICS

Job ID: 6411180 Cluster: snellius User/Group: lootes/lootes State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 192 CPU Utilized: 02:17:00 CPU Efficiency: 54.89% of 04:09:36 core-walltime Job Wall-clock time: 00:01:18 Memory Utilized: 294.44 GB (estimated maximum) Memory Efficiency: 87.63% of 336.00 GB (336.00 GB/node)

laurasootes commented 4 months ago

The problem appears to be caused by Snellius maintenance and should be resolved once all nodes have been rebooted.

oporth commented 4 months ago

Just for completeness, here the response of surf:

*Dear Dr. Porth, After the latest maintenance, we experienced a performance degradation, which could be attributed to the energy efficiency monitoring system. The daemon has been disabled, and most nodes have been successfully rebooted. However, a few nodes are still in the 'drain' state and are awaiting reboot. Rebooted nodes should perform as before. Please check the the status of the nodes with sinfo. If you provide an estimate of SBUs losses (due to timeout or lower performance), we will compensate for that. We are sorry for the inconvenience this problem has caused. Kind regards, Stefan Wolfsheimer