E3SM-Project/transport_se

The tracer-transport mini-app (transport_se) is a simplified version of HOMME designed to facilitate experimentation and performance improvements.

To configure and build the mini-app:

1) cp configure.sh ~/work/ copy the configure script to a directory of your choosing 2) vi ~/work/configure.sh customize the USER-DEFINED PARAMETERS using a text editor 3) ~/work/configure.sh execute the script to configure and build the mini-app

To test the mini-app:

4) cd ~/work/bld/test navigate to the test directory 5) make install install the namelists and test scripts 6) vi run_ne8_test.sh customize the mpirun command, tasks per node, etc... 7) qsub run_ne8_test.sh submit a test to the debug queue (or execute it on a laptop) 8) tail -n 40 ./out_ne8jobid examine timing and error norm results 9) display ~/work/run/image* examine plots of tracer cross-sections

Tests: run_ne8_test.sh: ultra low-res test for quick debug/check
run_ne30_test.sh 1 degree test with 4 tracers for verification run_ne120_test.sh 1/4 degree test for additonal, optional verification at high-res This test is expensive and often sits in the que for 1 day.

run_ne8_perf.sh run_ne30_perf.sh run_ne120_perf.sh Test case for performance testing DCMIP1-1 for 1 hour of model time. (we might have to increase run time ) output disabled. most diagnostics disabled tracers increased to 35

Recommendations:

Step 1. Use "run_ne8_test.sh" to get the code up and running

Step 2. Use "run_ne30_test.sh" to verify the results are correct. L1, L2 and Linf errors, overshoot and undershoots should agree to 2-3 digits. One should also check that the tracer mass is conserved by looking at the "Q, Q diss" values in the stdout. For additional verification, compare the PDF plots of the solution with reference plots on the ACME confluence page, "Transport Mini-app Setup and Test Results"

The results should be BFB when run on different numbers of MPI tasks and/or openMP threads. We should add a test for this.

Step 3. Use "run_ne120_perf.sh" as a starting point for performance testing.

Note: This mini-app is using DCMIP tests 1-1 and 1-2, modified to use the ACME 72L configuration instead of the DCMIP equally spaced levels. The performance characteristics will be very faithful to the HOMME dycore as used by ACME. However, the ACME 72L configuration has much larger errors since there are fewer levels where the tracers are located. In addition, surface pressure is not treated correctly - for consistency with prescribed winds and disabled dynamics, we evolve the surface pressure using the implicit surface pressure from the tracer advection scheme. Thus the simulation output of this mini-app should not be used for DCMIP test case results.

Typical results for run_ne8_test.sh on NERSC:Hopper

        name     procs  threads  count         walltotal     wallmax (proc   thrd  )   wallmin (proc   thrd  )

DCMIP1-1: prim_run 48 48 3.456000e+04 1.853381e+03 38.974 ( 0 0) 25.267 ( 1 0) DCMIP1-2: prim_run 48 48 2.880000e+03 9.024055e+01 1.892 ( 32 0) 1.660 ( 1 0)

DCMIP1-1: L1=0.564863 L2= 0.527211 Linf= 0.478169 q_max= 0.648135 q_min= -0.0615289 DCMIP1-2: L1=0.368728 L2= 0.558725 Linf= 1.040230 q_max= 0.991667 q_min= -5.59564e-05

Used Resources: cput=00:00:07,energy_used=0,mem=12220kb,vmem=123196kb,walltime=00:01:29

Updated 2015-6-27 With rsplit=1 and limiter=8 (monotone limiter) on Edison DCMIP 1-1: L1=0.529721 L2= 0.637361 Linf= 0.616722 q_max= 0.459436 q_min= -2.24628e-08 DCMIP 1-2: L1=0.386038 L2= 0.556078 Linf= 0.914061 q_max= 0.987406 q_min= -2.64163e-06 On Darwin: DCMIP 1-1: L1=0.529721 L2= 0.637360 Linf= 0.616721 q_max= 0.459436 q_min= -2.246253e-08 DCMIP 1-2: L1=0.386038 L2= 0.556077 Linf= 0.914060 q_max= 0.987405 q_min= -2.642738e-06

Updated 2015-7-14 (new direct addressing limiter) DCMIP 1-1: L1=0.529676 L2= 0.637311 Linf= 0.616969 q_max= 0.459701 q_min= -9.78542e-09 DCMIP 1-2: L1=0.386039 L2= 0.556071 Linf= 0.91406 q_max= 0.987389 q_min= -2.77826e-06

Updated 2015-11-27 (rsplit=3, to match CAM) DCMIP 1-1: L1=0.529725 L2=0.637015 Linf=0.610602 q_max=0.458165 q_min= -9.533290e-14 DCMIP 1-2: L1=0.263420 L2=0.388462 Linf=0.638387 q_max=0.989853 q_min= -1.274771e-08

Updated 2015-11-27 (rsplit=3, ACME 72 level configuration, skybridge) DCMIP 1-1: L1=0.578151 L2=0.865526 Linf=0.883168 q_max=0.187204 q_min= -3.207090e-13 DCMIP 1-2: L1=0.307665 L2=0.622099 Linf=0.839133 q_max=0.813105 q_min= -9.385639e-06

Typical results for run_ne30_test.sh on NERSC:Hopper

        name      procs  threads  count         walltotal     wallmax (proc   thrd  )   wallmin (proc   thrd  )

DCMIP1-1 prim_run 216 216 7.464960e+05 8.528053e+04 394.886 ( 6 0) 393.686 ( 1 0) DCMIP1-2 prim_run 216 216 6.220800e+04 5.899456e+03 27.387 ( 202 0) 25.604 ( 1 0)

DCMIP1-1: L1=0.1810760 L2= 0.209121 Linf= 0.317952 q_max= 1.014850 q_min= -0.0410253 DCMIP1-2: L1=0.0799994 L2= 0.150355 Linf= 0.297211 q_max= 0.977932 q_min= -5.7599e-13

Used Resources: cput=00:00:09,energy_used=0,mem=97628kb,vmem=208712kb,walltime=00:08:41

Updated 2015-6-27 With rsplit limiter=8 (monotone limiter) on Edison DCMIP 1-1: L1=0.138421 L2= 0.202415 Linf= 0.298503 q_max= 0.974222 q_min= -7.4405e-14 DCMIP 1-2: L1=0.0578899 L2= 0.123954 Linf= 0.270832 q_max= 0.979736 q_min= -1.26331e-12

Updated 2015-7-14 (new direct addressing limiter) DCMIP 1-1: L1=0.138438 L2= 0.202473 Linf= 0.298522 q_max= 0.974238 q_min= -7.27863e-14 DCMIP 1-2: L1=0.0578902 L2= 0.123954 Linf= 0.270832 q_max= 0.979733 q_min= -4.54058e-14

Updated 2015-11-27 (rsplit=3, to match CAM) DCMIP 1-1: L1=0.138438 L2= 0.202473 Linf= 0.298522 q_max= 0.97423 q_min= -7.633613e-14 DCMIP 1-2: L1=0.057890 L2= 0.123954 Linf= 0.270831 q_max= 0.97973 q_min= -4.540577e-14

Updated 2015-11-27 (rsplit=3, ACME 72 level config, skybridge) DCMIP 1-1: L1=0.490013 L2=0.789052 Linf=0.918454 q_max=0.445141 q_min= -3.559994e-11 DCMIP 1-2: L1=0.121783 L2=0.361005 Linf=1.092784 q_max=0.836177 q_min= -3.671997e-05

Typical results for run_ne120_test.sh on NERSC:Hopper

Updated 2015-6-27 With rsplit limiter=8 (monotone limiter) on Edison

DCMIP1-1 prim_run 960 960 4.423680e+06 2.117038e+06 2205.253 ( 709 0) 2205.234 ( 259 0) DCMIP1-2 prim_run 960 960 3.686400e+05 1.279739e+05 133.365 ( 662 0) 133.245 ( 443 0)

DCMIP 1-1: L1=0.0794909 L2= 0.139211 Linf= 0.306975 q_max= 0.984071 q_min= -1.84532e-13 DCMIP 1-2: L1=0.0363203 L2= 0.0987049 Linf= 0.277493 q_max= 0.994146 q_min= -5.89671e-14

Updated 2015-7-14 (new direct addressing limiter) DCMIP 1-1: L1=0.0794911 L2= 0.139212 Linf= 0.30697 q_max= 0.984063 q_min= -1.76014e-13 DCMIP 1-2: L1=0.0363203 L2= 0.0987048 Linf= 0.277493 q_max= 0.994146 q_min= -5.06671e-14

Updated 2015-11-27 (rsplit=3, ACME 72 level config) (something is wrong with the DCMIP 1-1 configuration on 72L, but it should still be ok for performace ) DCMIP 1-1: L1=0.479398 L2=0.782613 Linf=0.922696 q_max=0.501561 q_min= -1.070570e-09 DCMIP 1-2: L1=0.081287 L2=0.264887 Linf=0.591157 q_max=0.959530 q_min= -2.795861e-09

Typical results for run_ne120_perf.sh on Edison:

A x B x C, with A = nodes B = mpitasks_per_node C = threads_per_mpitask

40 node cases. 2160 elements per node. 1h simulation time

                                 2015-6-27            2015-11-29
                                 50 tracers            35 tracers
                                 64L                   72L
                                 rsplit=1              rsplit=3

40x24x1 prim_run 42.643 prim_advec_tracers_remap_rk2 53.343 37.195 vertical_remap 1.800 1.527

40x12x2
prim_run 43.306
prim_advec_tracers_remap_rk2 56.085 37.826
vertical_remap 1.801 1.518

40x6x4
prim_run 44.176 prim_advec_tracers_remap_rk2 63.195 38.644
vertical_remap 1.769 1.473

40x2x12
prim_run 46.040 prim_advec_tracers_remap_rk2 88.949 40.390 vertical_remap 1.765 1.492

E3SM-Project / transport_se

readme