ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
308 stars 312 forks source link

Problems with running fates_next_api/release-clm5.0 on izumi #1093

Open ekluzek opened 4 years ago

ekluzek commented 4 years ago

Brief summary of bug

Jackie has had problems running on izumi of late with fates_next_api. @jkshuman

General bug information

CTSM version you are using: release-clm5.0.30-143-gabcd5937

Does this bug cause significantly incorrect results in the model's science? No Configurations affected: izumi_intel

Details of bug

Failure of building gptl.

Important details of your setup / configuration so we can reproduce the bug

I think this is just because fates_next_api is using cime5.6.28 and needs to be updated to cime5.6.33

Important output or errors that show the problem

got another fail with gptl: 4:27 Finished creating component namelists Building gptl with output to file /scratch/cluster/jkshuman/test2_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/gptl.bldlog.200730-162622 Calling /home/jkshuman/git/fates_next_api/cime/src/build_scripts/buildlib.gptl ERROR: /home/jkshuman/git/fates_next_api/cime/src/build_scripts/buildlib.gptl FAILED, cat /scratch/cluster/jkshuman/test2_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/gptl.bldlog.200730-162622

4:28 my case script is here in case it is another obvious error. 4:29 path: /home/jkshuman/FATES_data/boreal/above_canada/case_izu_GLDAS_ABoVE_canada

ekluzek commented 4 years ago

From looking at the ChangeLog for cime, I think this should drop in with no problems, and no change of answers.

ekluzek commented 4 years ago

Here's another log message:

[jkshuman@izumi bld]$ cat gptl.bldlog.200729-182059 gmake --output-sync -f /home/jkshuman/git/fates_next_api/cime/src/share/timing/Makefile install -C /scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl MACFILE=/home/jkshuman/FATES_cases/Canada/test/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/Macros.make MODEL=gptl GPTL_DIR=/home/jkshuman/git/fates_next_api/cime/src/share/timing GPTL_LIBDIR=/scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl SHAREDPATH=/scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads gmake: Entering directory '/scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl' mpicc -c -I/home/jkshuman/git/fates_next_api/cime/src/share/timing -qno-opt-dynamic-align -fp-model precise -std=gnu99 -lifcore -O2 -debug minimal -DHAVE_NANOTIME -DBIT64 -DHAVE_VPRINTF -DHAVE_BACKTRACE -DHAVE_SLASHPROC -DHAVE_COMM_F2C -DHAVE_TIMES -DHAVE_GETTIMEOFDAY -DFORTRANUNDERSCORE -DCPRINTEL -DHAVE_MPI /home/jkshuman/git/fates_next_api/cime/src/share/timing/gptl.c /home/jkshuman/git/fates_next_api/cime/src/share/timing/Makefile:57: recipe for target 'gptl.o' failed gmake: Leaving directory '/scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl' gmake: [gptl.o] Error 127 ERROR: gmake: [gptl.o] Error 127[jkshuman@izumi bld]$

jkshuman commented 4 years ago

Thanks @ekluzek I will follow up with you on where to get the updated cime. unless you want to post here, and I can update and test.

ekluzek commented 4 years ago

@jkshuman use cime5.6.33 and see if it works.

ekluzek commented 4 years ago

@jkshuman you must be watching now (or I just had a glitch Friday), as you now show up when I start typing your name. If I start typing someone's name and they don't show up as an option, it's usually because they aren't watching.

jkshuman commented 4 years ago

@ekluzek still getting a fail on Izumi. Let me know if I am missing something: Working in this clone directory: /home/jkshuman/git/fates_next_api Updated cime in the Externals.cfg file to cime.5.6.33 Ran ./manage_externals/checkout_externals cime (also tried running ./manage_externals/checkout_externals and got same fail) inside cime folder git describe shows an update to: cime5.6.32-16-gb5d8cb94e

case build still fails on Izumi: Calling /home/jkshuman/git/fates_next_api/cime/src/components/stub_comps/sesp/cime_config/buildnml Calling /home/jkshuman/git/fates_next_api/cime/src/drivers/mct/cime_config/buildnml Finished creating component namelists Building gptl with output to file /scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/gptl.bldlog.200803-172712 Calling /home/jkshuman/git/fates_next_api/cime/src/build_scripts/buildlib.gptl ERROR: /home/jkshuman/git/fates_next_api/cime/src/build_scripts/buildlib.gptl FAILED, cat /scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/gptl.bldlog.200803-172712

jkshuman commented 4 years ago

error in that file: gmake --output-sync -f /home/jkshuman/git/fates_next_api/cime/src/share/timing/Makefile install -C /scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl MACFILE=/home/jkshuman/FATES_cases/Canada/test/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/Macros.make MODEL=gptl GPTL_DIR=/home/jkshuman/git/fates_next_api/cime/src/share/timing GPTL_LIBDIR=/scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl SHAREDPATH=/scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads gmake: Entering directory '/scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl' mpicc -c -I/home/jkshuman/git/fates_next_api/cime/src/share/timing -qno-opt-dynamic-align -fp-model precise -std=gnu99 -O2 -debug minimal -DHAVE_NANOTIME -DBIT64 -DHAVE_VPRINTF -DHAVE_BACKTRACE -DHAVE_SLASHPROC -DHAVE_COMM_F2C -DHAVE_TIMES -DHAVE_GETTIMEOFDAY -DFORTRANUNDERSCORE -DCPRINTEL -DHAVE_MPI /home/jkshuman/git/fates_next_api/cime/src/share/timing/gptl.c /home/jkshuman/git/fates_next_api/cime/src/share/timing/Makefile:57: recipe for target 'gptl.o' failed gmake: Leaving directory '/scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl' gmake: *** [gptl.o] Error 127 ERROR: gmake: *** [gptl.o] Error 127(base)

ekluzek commented 4 years ago

OK, I verified the same problem by testing fates_next_api, with both default cime and cime5.6.33...

SMS.f09_g17.I2000Clm50Fates.izumi_intel.clm-FatesColdDef

Then I also tried it on the release branch and see the same problem (cime5.6.33 is the default in release-clm5.0.34).

I also tried the more generic test

SMS.f09_g17.I2000Clm50BgcCrop.izumi_intel.clm-default

and it fails as well.

ekluzek commented 4 years ago

OK, it looks like the izumi updates went in cime maint-5.6 branch -- but haven't been tagged. When I point cime to the latest maint-5.6 branch, it does seem to build.

This is the cime PR with the needed updates...

https://github.com/ESMCI/cime/pull/3561

jkshuman commented 4 years ago

@ekluzek model builds and submits, but fails after first time-step. (This same case was successful on Hobart) "killed by signal 15" here is the error: run command is mpiexec --machinefile /var/spool/torque/aux//337043.izumi.unified.ucar.edu -n 384 /scratch/cluster/jkshuman/cime_t3_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/cesm.exe >> cesm.log.$LID 2>&1 2020-08-04 13:47:37 MODEL EXECUTION HAS FINISHED check for resubmit dout_s True mach izumi resubmit_num 23 -------------------- Post Job Clean Up -------------------- Running cleanipcs as jkshuman on i041.unified.ucar.edu Killed by signal 15. Terminated i035.unified.ucar.edu Connection to i035.unified.ucar.edu closed by remote host.

jkshuman commented 4 years ago

@ekluzek I tried a run that uses only 1 node, and got same fail. The two tests have similar fail where they complete the first time-step, and then fail on resubmit. First run was an 8 node run with monthly time-step, second test was 1 node with yearly time-step) Same fail: "killed by signal 15"

the 1 node case (junk no fire) will continue if I resubmit manually from inside the case. did not test 8 node case. manual resubmit works on this case: /scratch/cluster/jkshuman/junk_cime_nofire_4x5_6f568e42_e9f63270/run

jkshuman commented 4 years ago

and this junk fire case running on 1 node was able to make it into year 2 automatically... /scratch/cluster/jkshuman/junk_cime_izu_fire_4x5_6f568e42_e9f63270/run

jkshuman commented 4 years ago

Note that the cime fix for Izumi works. Fix: Modify Externals.cfg file to point at maint-5.6 branch for access to the necessary Izumi updates [cime] local_path = cime protocol = git repo_url = https://github.com/ESMCI/cime branch = maint-5.6 required = True

The resubmit problem is a different issue (and inconsistent as not all cases fail for my test cases). @ekluzek should we close this and open a separate issue on this resubmit?

@jedwards4b (is this Jim Edwards?) suggested including a workaround option: ./case.submit --resubmit-immediate This option is functional on Izumi, and a test case is into year 3 (annual time-step) with this option.

per Jim "Are you aware of the resubmit immediate option to case.submit? It will submit all of your jobs at once from the login node with dependancies so that each job will complete before the next begins. This should be an effective workaround for the problem compute nodes not resubmitting properly."

glemieux commented 4 years ago

@ekluzek I just accidentally replicated the above error on my workstation trying to build a single site case. Last week while helping @jkshuman track down the issue using my workstation I had been able to successfully build and run with sci.1.40.1_api.13.0.1 and fates_next_api release-clm5.0.30-143-gabcd5937 (with cime5.6.28).

The trigger for the failure this time was that I was trying to build the case with a conda environment activated that I don't normally use during case builds. Perhaps that suggests it's an issue with the module versions loaded on izumi? I can provide my output from conda list if you think it'd be helpful.

jkshuman commented 4 years ago

@glemieux this is interesting. I just did an overhaul of my conda environments. though I do not recall which conda was active (if any) when I ran these test cases.

jkshuman commented 4 years ago

tested cime branch cime.5.8.30 on Izumi with fates_main_api per @ekluzek recommendation and simulation was successful. Thanks @ekluzek

ctsm commit fde33f56e9d65e7cebc79a7a2319d8b1e5959296 (HEAD -> fates_main_api, escomp_ctsm_repo/fates_main_api) cime5.8.30 fates_main commit 61a751c37181f18162e23851fff495db62fc807a (HEAD -> master, tag: sci.1.41.0_api.13.0.1, origin/master, origin/HEAD)

path to output: /scratch/cluster/jkshuman/archive/t4_izumi_JKS_C3_main_4x5_fde33f56_6bfea0f8/lnd/hist

glemieux commented 4 years ago

@ekluzek noted in the ctsm software meeting today that cime5.8.24 is minimum necessary to alleviate this issue. For fates_main_api this should be taken care of for fates_main_api with PR #1137 as it brings the branch up to cime5.8.28.