ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
310 stars 315 forks source link

Non-serial versions of tests using `5x5_amazon` failing `RUN` #2423

Closed glemieux closed 3 weeks ago

glemieux commented 8 months ago

Brief summary of bug

mpibind seems to have an issue with 5x5_amazon resolutions when run with full mpi (i.e. no MPI-serial) since ctsm5.1.dev173. Originally posted at https://github.com/NCAR/mpibind/issues/5.

General bug information

CTSM version you are using: ctsm5.1.dev173

Does this bug cause significantly incorrect results in the model's science? [Yes / No] Run fails so no assessment possible

Details of bug

This was discovered when running the FatesColdSeedDispersal test while generating new fates baselines for the dev173 update. I was able to also replicate this failure using a non-serial MPI version of the hillslope clm-only test. The run immediately fails producing a cesm.log entry with a note about one of the core selections being invalid (see below). It also produced an mpibind.log that I hadn't noticed before.

This prompted me to compare dev172 and dev173 runs for non-serial MPI versions of the hillslope test that use 5x5_amazon. The dev172 version passes, but I noticed that the preview_run output is different:

dev172:

    MPIRUN (job=case.test):
      mpiexec  --label  --line-buffer  -n 5 /glade/derecho/scratch/glemieux/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope/SMS_D_Ld5.5x5_amazon.I1850Clm51Bgc.derecho_gnu.clm-HillslopeC.mpi-nonserial-check-clm_hillslope/bld/cesm.exe   >> cesm.log.$LID 2>&1 

dev173:

    MPIRUN (job=case.test):
      mpibind  --label  --line-buffer  --  /glade/derecho/scratch/glemieux/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173/SMS_D_Ld5.5x5_amazon.I1850Clm51Bgc.derecho_gnu.clm-HillslopeC.mpi-nonserial-check-clm_hillslope-dev173/bld/cesm.exe   >> cesm.log.$LID 2>&1 

What is odd to me is that mpibind was brought in dev172 via ccs_config_cesm0.0.92, so why is the call not activated for that tag? Why is it only being invoked with dev173?

Important details of your setup / configuration so we can reproduce the bug

You can view the SRCROOT_GIT_STATUS files for both dev173 and dev172 hillslope runs here, respectively: /glade/u/home/glemieux/scratch/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173 /glade/u/home/glemieux/scratch/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope

Important output or errors that show the problem

cesm.log

  1 dec0417.hsn.de.hpc.ucar.edu 4: <65-65> is invalid
  2 dec0417.hsn.de.hpc.ucar.edu 4: libnuma: Warning: cpu argument 65-65 is out of range
  3 dec0417.hsn.de.hpc.ucar.edu 4:
  4 dec0417.hsn.de.hpc.ucar.edu 4: usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i <nodes>]
  5 dec0417.hsn.de.hpc.ucar.edu 4:                [--preferred= | -p <node>] [--physcpubind= | -C <cpus>]
  6 dec0417.hsn.de.hpc.ucar.edu 4:                [--cpunodebind= | -N <nodes>] [--membind= | -m <nodes>]
  7 dec0417.hsn.de.hpc.ucar.edu 4:                [--localalloc | -l] command args ...
  8 dec0417.hsn.de.hpc.ucar.edu 4:        numactl [--show | -s]
  9 dec0417.hsn.de.hpc.ucar.edu 4:        numactl [--hardware | -H]
 10 dec0417.hsn.de.hpc.ucar.edu 4:        numactl [--length | -L <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
 11 dec0417.hsn.de.hpc.ucar.edu 4:                [--strict | -t]
 12 dec0417.hsn.de.hpc.ucar.edu 4:                [--shmid | -I <id>] --shm | -S <shmkeyfile>
 13 dec0417.hsn.de.hpc.ucar.edu 4:                [--shmid | -I <id>] --file | -f <tmpfsfile>
 14 dec0417.hsn.de.hpc.ucar.edu 4:                [--huge | -u] [--touch | -T]
 15 dec0417.hsn.de.hpc.ucar.edu 4:                memory policy [--dump | -d] [--dump-nodes | -D]
 16 dec0417.hsn.de.hpc.ucar.edu 4:
 17 dec0417.hsn.de.hpc.ucar.edu 4: memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
 18 dec0417.hsn.de.hpc.ucar.edu 4: <nodes> is a comma delimited list of node numbers or A-B ranges or all.
 19 dec0417.hsn.de.hpc.ucar.edu 4: Instead of a number a node can also be:
 20 dec0417.hsn.de.hpc.ucar.edu 4:   netdev:DEV the node connected to network device DEV
 21 dec0417.hsn.de.hpc.ucar.edu 4:   file:PATH  the node the block device of path is connected to
 22 dec0417.hsn.de.hpc.ucar.edu 4:   ip:HOST    the node of the network device host routes through
 23 dec0417.hsn.de.hpc.ucar.edu 4:   block:PATH the node of block device path
 24 dec0417.hsn.de.hpc.ucar.edu 4:   pci:[seg:]bus:dev[:func] The node of a PCI device
 25 dec0417.hsn.de.hpc.ucar.edu 4: <cpus> is a comma delimited list of cpu numbers or A-B ranges or all
 26 dec0417.hsn.de.hpc.ucar.edu 4: all ranges can be inverted with !
 27 dec0417.hsn.de.hpc.ucar.edu 4: all numbers and ranges can be made cpuset-relative with +
 28 dec0417.hsn.de.hpc.ucar.edu 4: the old --cpubind argument is deprecated.
 29 dec0417.hsn.de.hpc.ucar.edu 4: use --cpunodebind or --physcpubind instead
 30 dec0417.hsn.de.hpc.ucar.edu 4: use --balancing | -b to enable Linux kernel NUMA balancing
 31 dec0417.hsn.de.hpc.ucar.edu 4: for the process if it is supported by kernel
 32 dec0417.hsn.de.hpc.ucar.edu 4: <length> can have g (GB), m (MB) or k (KB) suffixes
 33 dec0417.hsn.de.hpc.ucar.edu 3: <64-64> is invalid
 34 dec0417.hsn.de.hpc.ucar.edu 3: libnuma: Warning: cpu argument 64-64 is out of range
 35 dec0417.hsn.de.hpc.ucar.edu 3:
 36 dec0417.hsn.de.hpc.ucar.edu 3: usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i <nodes>]
 37 dec0417.hsn.de.hpc.ucar.edu 3:                [--preferred= | -p <node>] [--physcpubind= | -C <cpus>]
 38 dec0417.hsn.de.hpc.ucar.edu 3:                [--cpunodebind= | -N <nodes>] [--membind= | -m <nodes>]
 39 dec0417.hsn.de.hpc.ucar.edu 3:                [--localalloc | -l] command args ...
 40 dec0417.hsn.de.hpc.ucar.edu 3:        numactl [--show | -s]
 41 dec0417.hsn.de.hpc.ucar.edu 3:        numactl [--hardware | -H]
 42 dec0417.hsn.de.hpc.ucar.edu 3:        numactl [--length | -L <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
 43 dec0417.hsn.de.hpc.ucar.edu 3:                [--strict | -t]
 44 dec0417.hsn.de.hpc.ucar.edu 3:                [--shmid | -I <id>] --shm | -S <shmkeyfile>
 45 dec0417.hsn.de.hpc.ucar.edu 3:                [--shmid | -I <id>] --file | -f <tmpfsfile>
 46 dec0417.hsn.de.hpc.ucar.edu 3:                [--huge | -u] [--touch | -T]
 47 dec0417.hsn.de.hpc.ucar.edu 3:                memory policy [--dump | -d] [--dump-nodes | -D]
 48 dec0417.hsn.de.hpc.ucar.edu 3:
dec0417.hsn.de.hpc.ucar.edu 3: memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
 50 dec0417.hsn.de.hpc.ucar.edu 3: <nodes> is a comma delimited list of node numbers or A-B ranges or all.
 51 dec0417.hsn.de.hpc.ucar.edu 3: Instead of a number a node can also be:
 52 dec0417.hsn.de.hpc.ucar.edu 3:   netdev:DEV the node connected to network device DEV
 53 dec0417.hsn.de.hpc.ucar.edu 3:   file:PATH  the node the block device of path is connected to
 54 dec0417.hsn.de.hpc.ucar.edu 3:   ip:HOST    the node of the network device host routes through
 55 dec0417.hsn.de.hpc.ucar.edu 3:   block:PATH the node of block device path
 56 dec0417.hsn.de.hpc.ucar.edu 3:   pci:[seg:]bus:dev[:func] The node of a PCI device
 57 dec0417.hsn.de.hpc.ucar.edu 3: <cpus> is a comma delimited list of cpu numbers or A-B ranges or all
 58 dec0417.hsn.de.hpc.ucar.edu 3: all ranges can be inverted with !
 59 dec0417.hsn.de.hpc.ucar.edu 3: all numbers and ranges can be made cpuset-relative with +
 60 dec0417.hsn.de.hpc.ucar.edu 3: the old --cpubind argument is deprecated.
 61 dec0417.hsn.de.hpc.ucar.edu 3: use --cpunodebind or --physcpubind instead
 62 dec0417.hsn.de.hpc.ucar.edu 3: use --balancing | -b to enable Linux kernel NUMA balancing
 63 dec0417.hsn.de.hpc.ucar.edu 3: for the process if it is supported by kernel
 64 dec0417.hsn.de.hpc.ucar.edu 3: <length> can have g (GB), m (MB) or k (KB) suffixes
 65 dec0417.hsn.de.hpc.ucar.edu: rank 3 exited with code 1
 66 dec0417.hsn.de.hpc.ucar.edu: rank 0 died from signal 15

mpibind.log

Chunk info
  1:ncpus=5:mpiprocs=5:ompthreads=1:mem=230GB:Qlist=cpu:ngpus=0
-- -- -- --
MPI exec line:
  mpiexec --label --line-buffer -n 5 -ppn 5 --cpu-bind none -env OMP_NUM_THREADS=1 /glade/u/apps/opt/mpitools/mpibind/cpu_bind /glade/derecho/scratch/glemieux/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173/SMS_D_Ld5.5x5_amazon.I1850Clm51Bgc.derecho_gnu.clm-HillslopeC.mpi-nonserial-check-clm_hillslope-dev173/bld/cesm.exe 
-- -- -- --
Binding Report:
rank: 0, cores: 0-0
rank: 1, cores: 1-1
rank: 3, cores: 64-64
rank: 4, cores: 65-65
glemieux commented 8 months ago

@ekluzek given the feedback from https://github.com/NCAR/mpibind/issues/5#issuecomment-1998714383, should I make an issue in the ccs_config_cesm repo?

ekluzek commented 8 months ago

@glemieux yes go ahead and do that.

glemieux commented 8 months ago

During the ctsm stand-up meeting today we came up with the following actions for the time being:

It was also noted that this doesn't seem to be an issue for izumi

ekluzek commented 8 months ago

@glemieux note this also relates to another problem I ran into:

https://github.com/ESCOMP/CTSM/pull/2427#issuecomment-2016048650

where the new use of mpibind needed me to do something different for mksurfdata_esmf.

ekluzek commented 8 months ago

The ccs_config issue is here:

https://github.com/ESMCI/ccs_config_cesm/issues/142

glemieux commented 8 months ago

During the ctsm stand-up meeting today we came up with the following actions for the time being:

  • [x] Add a non-serial 5x5_amazon test to aux_clm on derecho and to the expected failure list referencing this issue.
  • [x] Temporarily convert the FatesColdSeedDisp testmod to run on f10

It was also noted that this doesn't seem to be an issue for izumi

Completed these actions items per #2436.

samsrabin commented 2 months ago

It seems like the non-serial 5x5_amazon test (SMS_D_Ld5.5x5_amazon.I1850Clm60Bgc.derecho_gnu.clm-HillslopeC) is now passing as of ctsm5.2.027. Should this issue be closed and that test removed from the expected failure list?

samsrabin commented 2 months ago

Actually, it would probably be worth checking whether the original test you noticed this with—the FatesColdSeedDispersal one—still fails.

ekluzek commented 2 months ago

@samsrabin good question on the removal of the MPI version of this test. The utility of the MPI test is to check that MPI works for a simple regional grid. As a way to make sure small regional cases work with MPI in general. It also makes sure you can use MPI for a grid that's only a fraction of a node.

Now at this point we also have the nldas2 grid that we test that's a larger regional grid so we could call that sufficient.

The advantage here though is that 5x5 amazon is a simple, fast, small grid for testing. So I like the idea of keeping it for at least some of our testing, if not this specific test for fates seed dispersal.

samsrabin commented 2 months ago

Thanks, Erik. I am indeed planning (#2434) to keep a serial 5x5_amazon hillslope test in the aux_clm suite, but I've moved the parallel version @glemieux added to just be in the special hillslope suite.

wwieder commented 3 weeks ago

@glemieux will check on this to see if it's still an issue

glemieux commented 3 weeks ago

I can confirm that the original issue is resolved. I reinstated the 5x5_amazon test that was changed via e9fb075891f90f97159071d03b376ed44c451bfc and the run passed without issue. I'll make another issue to reinstate the test.

Test case: /glade/u/home/glemieux/scratch/ctsm-tests/tests_1031-111717de