Open ekluzek opened 4 years ago
From looking at the ChangeLog for cime, I think this should drop in with no problems, and no change of answers.
Here's another log message:
[jkshuman@izumi bld]$ cat gptl.bldlog.200729-182059 gmake --output-sync -f /home/jkshuman/git/fates_next_api/cime/src/share/timing/Makefile install -C /scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl MACFILE=/home/jkshuman/FATES_cases/Canada/test/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/Macros.make MODEL=gptl GPTL_DIR=/home/jkshuman/git/fates_next_api/cime/src/share/timing GPTL_LIBDIR=/scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl SHAREDPATH=/scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads gmake: Entering directory '/scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl' mpicc -c -I/home/jkshuman/git/fates_next_api/cime/src/share/timing -qno-opt-dynamic-align -fp-model precise -std=gnu99 -lifcore -O2 -debug minimal -DHAVE_NANOTIME -DBIT64 -DHAVE_VPRINTF -DHAVE_BACKTRACE -DHAVE_SLASHPROC -DHAVE_COMM_F2C -DHAVE_TIMES -DHAVE_GETTIMEOFDAY -DFORTRANUNDERSCORE -DCPRINTEL -DHAVE_MPI /home/jkshuman/git/fates_next_api/cime/src/share/timing/gptl.c /home/jkshuman/git/fates_next_api/cime/src/share/timing/Makefile:57: recipe for target 'gptl.o' failed gmake: Leaving directory '/scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl' gmake: [gptl.o] Error 127 ERROR: gmake: [gptl.o] Error 127[jkshuman@izumi bld]$
Thanks @ekluzek I will follow up with you on where to get the updated cime. unless you want to post here, and I can update and test.
@jkshuman use cime5.6.33 and see if it works.
@jkshuman you must be watching now (or I just had a glitch Friday), as you now show up when I start typing your name. If I start typing someone's name and they don't show up as an option, it's usually because they aren't watching.
@ekluzek still getting a fail on Izumi. Let me know if I am missing something:
Working in this clone directory: /home/jkshuman/git/fates_next_api
Updated cime in the Externals.cfg file to cime.5.6.33
Ran ./manage_externals/checkout_externals cime
(also tried running ./manage_externals/checkout_externals and got same fail)
inside cime folder git describe shows an update to: cime5.6.32-16-gb5d8cb94e
case build still fails on Izumi:
Calling /home/jkshuman/git/fates_next_api/cime/src/components/stub_comps/sesp/cime_config/buildnml Calling /home/jkshuman/git/fates_next_api/cime/src/drivers/mct/cime_config/buildnml Finished creating component namelists Building gptl with output to file /scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/gptl.bldlog.200803-172712 Calling /home/jkshuman/git/fates_next_api/cime/src/build_scripts/buildlib.gptl ERROR: /home/jkshuman/git/fates_next_api/cime/src/build_scripts/buildlib.gptl FAILED, cat /scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/gptl.bldlog.200803-172712
error in that file:
gmake --output-sync -f /home/jkshuman/git/fates_next_api/cime/src/share/timing/Makefile install -C /scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl MACFILE=/home/jkshuman/FATES_cases/Canada/test/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/Macros.make MODEL=gptl GPTL_DIR=/home/jkshuman/git/fates_next_api/cime/src/share/timing GPTL_LIBDIR=/scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl SHAREDPATH=/scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads gmake: Entering directory '/scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl' mpicc -c -I/home/jkshuman/git/fates_next_api/cime/src/share/timing -qno-opt-dynamic-align -fp-model precise -std=gnu99 -O2 -debug minimal -DHAVE_NANOTIME -DBIT64 -DHAVE_VPRINTF -DHAVE_BACKTRACE -DHAVE_SLASHPROC -DHAVE_COMM_F2C -DHAVE_TIMES -DHAVE_GETTIMEOFDAY -DFORTRANUNDERSCORE -DCPRINTEL -DHAVE_MPI /home/jkshuman/git/fates_next_api/cime/src/share/timing/gptl.c /home/jkshuman/git/fates_next_api/cime/src/share/timing/Makefile:57: recipe for target 'gptl.o' failed gmake: Leaving directory '/scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl' gmake: *** [gptl.o] Error 127 ERROR: gmake: *** [gptl.o] Error 127(base)
OK, I verified the same problem by testing fates_next_api, with both default cime and cime5.6.33...
SMS.f09_g17.I2000Clm50Fates.izumi_intel.clm-FatesColdDef
Then I also tried it on the release branch and see the same problem (cime5.6.33 is the default in release-clm5.0.34).
I also tried the more generic test
SMS.f09_g17.I2000Clm50BgcCrop.izumi_intel.clm-default
and it fails as well.
OK, it looks like the izumi updates went in cime maint-5.6 branch -- but haven't been tagged. When I point cime to the latest maint-5.6 branch, it does seem to build.
This is the cime PR with the needed updates...
@ekluzek model builds and submits, but fails after first time-step. (This same case was successful on Hobart)
"killed by signal 15"
here is the error:
run command is mpiexec --machinefile /var/spool/torque/aux//337043.izumi.unified.ucar.edu -n 384 /scratch/cluster/jkshuman/cime_t3_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/cesm.exe >> cesm.log.$LID 2>&1 2020-08-04 13:47:37 MODEL EXECUTION HAS FINISHED check for resubmit dout_s True mach izumi resubmit_num 23 -------------------- Post Job Clean Up -------------------- Running cleanipcs as jkshuman on i041.unified.ucar.edu Killed by signal 15. Terminated i035.unified.ucar.edu Connection to i035.unified.ucar.edu closed by remote host.
@ekluzek I tried a run that uses only 1 node, and got same fail. The two tests have similar fail where they complete the first time-step, and then fail on resubmit. First run was an 8 node run with monthly time-step, second test was 1 node with yearly time-step) Same fail: "killed by signal 15"
the 1 node case (junk no fire) will continue if I resubmit manually from inside the case. did not test 8 node case. manual resubmit works on this case: /scratch/cluster/jkshuman/junk_cime_nofire_4x5_6f568e42_e9f63270/run
and this junk fire case running on 1 node was able to make it into year 2 automatically... /scratch/cluster/jkshuman/junk_cime_izu_fire_4x5_6f568e42_e9f63270/run
Note that the cime fix for Izumi works.
Fix: Modify Externals.cfg file to point at maint-5.6 branch for access to the necessary Izumi updates
[cime]
local_path = cime
protocol = git
repo_url = https://github.com/ESMCI/cime
branch = maint-5.6
required = True
The resubmit problem is a different issue (and inconsistent as not all cases fail for my test cases). @ekluzek should we close this and open a separate issue on this resubmit?
@jedwards4b (is this Jim Edwards?) suggested including a workaround option: ./case.submit --resubmit-immediate This option is functional on Izumi, and a test case is into year 3 (annual time-step) with this option.
per Jim "Are you aware of the resubmit immediate option to case.submit? It will submit all of your jobs at once from the login node with dependancies so that each job will complete before the next begins. This should be an effective workaround for the problem compute nodes not resubmitting properly."
@ekluzek I just accidentally replicated the above error on my workstation trying to build a single site case. Last week while helping @jkshuman track down the issue using my workstation I had been able to successfully build and run with sci.1.40.1_api.13.0.1
and fates_next_api release-clm5.0.30-143-gabcd5937
(with cime5.6.28
).
The trigger for the failure this time was that I was trying to build the case with a conda environment activated that I don't normally use during case builds. Perhaps that suggests it's an issue with the module versions loaded on izumi? I can provide my output from conda list
if you think it'd be helpful.
@glemieux this is interesting. I just did an overhaul of my conda environments. though I do not recall which conda was active (if any) when I ran these test cases.
tested cime branch cime.5.8.30 on Izumi with fates_main_api per @ekluzek recommendation and simulation was successful. Thanks @ekluzek
ctsm commit fde33f56e9d65e7cebc79a7a2319d8b1e5959296 (HEAD -> fates_main_api, escomp_ctsm_repo/fates_main_api)
cime5.8.30
fates_main commit 61a751c37181f18162e23851fff495db62fc807a (HEAD -> master, tag: sci.1.41.0_api.13.0.1, origin/master, origin/HEAD)
path to output: /scratch/cluster/jkshuman/archive/t4_izumi_JKS_C3_main_4x5_fde33f56_6bfea0f8/lnd/hist
@ekluzek noted in the ctsm software meeting today that cime5.8.24 is minimum necessary to alleviate this issue. For fates_main_api this should be taken care of for fates_main_api with PR #1137 as it brings the branch up to cime5.8.28.
Brief summary of bug
Jackie has had problems running on izumi of late with fates_next_api. @jkshuman
General bug information
CTSM version you are using: release-clm5.0.30-143-gabcd5937
Does this bug cause significantly incorrect results in the model's science? No Configurations affected: izumi_intel
Details of bug
Failure of building gptl.
Important details of your setup / configuration so we can reproduce the bug
I think this is just because fates_next_api is using cime5.6.28 and needs to be updated to cime5.6.33
Important output or errors that show the problem