Closed glemieux closed 3 weeks ago
@ekluzek given the feedback from https://github.com/NCAR/mpibind/issues/5#issuecomment-1998714383, should I make an issue in the ccs_config_cesm repo?
@glemieux yes go ahead and do that.
During the ctsm stand-up meeting today we came up with the following actions for the time being:
5x5_amazon
test to aux_clm
on derecho
and to the expected failure list referencing this issue.FatesColdSeedDisp
testmod to run on f10
It was also noted that this doesn't seem to be an issue for izumi
@glemieux note this also relates to another problem I ran into:
https://github.com/ESCOMP/CTSM/pull/2427#issuecomment-2016048650
where the new use of mpibind needed me to do something different for mksurfdata_esmf.
The ccs_config issue is here:
During the ctsm stand-up meeting today we came up with the following actions for the time being:
- [x] Add a non-serial
5x5_amazon
test toaux_clm
onderecho
and to the expected failure list referencing this issue.- [x] Temporarily convert the
FatesColdSeedDisp
testmod to run onf10
It was also noted that this doesn't seem to be an issue for
izumi
Completed these actions items per #2436.
It seems like the non-serial 5x5_amazon
test (SMS_D_Ld5.5x5_amazon.I1850Clm60Bgc.derecho_gnu.clm-HillslopeC
) is now passing as of ctsm5.2.027. Should this issue be closed and that test removed from the expected failure list?
Actually, it would probably be worth checking whether the original test you noticed this with—the FatesColdSeedDispersal
one—still fails.
@samsrabin good question on the removal of the MPI version of this test. The utility of the MPI test is to check that MPI works for a simple regional grid. As a way to make sure small regional cases work with MPI in general. It also makes sure you can use MPI for a grid that's only a fraction of a node.
Now at this point we also have the nldas2 grid that we test that's a larger regional grid so we could call that sufficient.
The advantage here though is that 5x5 amazon is a simple, fast, small grid for testing. So I like the idea of keeping it for at least some of our testing, if not this specific test for fates seed dispersal.
Thanks, Erik. I am indeed planning (#2434) to keep a serial 5x5_amazon
hillslope test in the aux_clm
suite, but I've moved the parallel version @glemieux added to just be in the special hillslope
suite.
@glemieux will check on this to see if it's still an issue
I can confirm that the original issue is resolved. I reinstated the 5x5_amazon
test that was changed via e9fb075891f90f97159071d03b376ed44c451bfc and the run passed without issue. I'll make another issue to reinstate the test.
Test case: /glade/u/home/glemieux/scratch/ctsm-tests/tests_1031-111717de
Brief summary of bug
mpibind
seems to have an issue with5x5_amazon
resolutions when run with full mpi (i.e. noMPI-serial
) since ctsm5.1.dev173. Originally posted at https://github.com/NCAR/mpibind/issues/5.General bug information
CTSM version you are using: ctsm5.1.dev173
Does this bug cause significantly incorrect results in the model's science? [Yes / No] Run fails so no assessment possible
Details of bug
This was discovered when running the
FatesColdSeedDispersal
test while generating new fates baselines for the dev173 update. I was able to also replicate this failure using a non-serial MPI version of thehillslope
clm-only test. The run immediately fails producing a cesm.log entry with a note about one of the core selections beinginvalid
(see below). It also produced an mpibind.log that I hadn't noticed before.This prompted me to compare dev172 and dev173 runs for non-serial MPI versions of the
hillslope
test that use5x5_amazon
. The dev172 version passes, but I noticed that thepreview_run
output is different:dev172:
dev173:
What is odd to me is that
mpibind
was brought in dev172 viaccs_config_cesm0.0.92
, so why is the call not activated for that tag? Why is it only being invoked with dev173?Important details of your setup / configuration so we can reproduce the bug
You can view the SRCROOT_GIT_STATUS files for both dev173 and dev172
hillslope
runs here, respectively:/glade/u/home/glemieux/scratch/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173
/glade/u/home/glemieux/scratch/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope
Important output or errors that show the problem
cesm.log
mpibind.log