ESMCI / cime

Common Infrastructure for Modeling the Earth
http://esmci.github.io/cime
Other
161 stars 206 forks source link

DAE tests sometimes fail (cime6.0.175) #4594

Closed samsrabin closed 6 months ago

samsrabin commented 6 months ago

Unfortunately this doesn't happen every time. The end of the logfile says something like this:

   Running /glade/u/home/samrabin/ctsm_hillslope_hydrology_derecho/cime/scripts/data_assimilation/da_no_data_mod.sh
/bin/sh: module: line 1: syntax error: unexpected end of file
/bin/sh: error importing function definition for `module'
check for resubmit
dout_s False
mach derecho
resubmit_num 0
ERROR: ERROR: Unrecognized line ('/bin/bash: module: line 1: syntax error: unexpected end of file
') found in /glade/derecho/scratch/samrabin/tests_0216-105810de/DAE_C2_D_Lh12.f10_f10_mg37.I2000Clm50BgcCrop.derecho_intel.clm-DA_multidrv.GC.0216-105810de/run/case2run/da.log.3093779.desched1.240216-123112.gz

Resubmitting usually fixes it, but sometimes it takes a few tries.

See, e.g., this log file:


/glade/derecho/scratch/samrabin/tests_0216-105810de/DAE_C2_D_Lh12.f10_f10_mg37.I2000Clm50BgcCrop.derecho_intel.clm-DA_multidrv.GC.0216-105810de/test.DAE_C2_D_Lh12.f10_f10_mg37.I2000Clm50BgcCrop.derecho_intel.clm-DA_multidrv.GC.0216-105810de.o3093779`
jedwards4b commented 6 months ago

Reproduced on cime master. I asked USG about the error message:

I am sporadically getting an error message when running bash scripts on derecho. The message doesn't seem to have any ill effects but is causing a test that looks for the keyword error in my output to fail - I think it is coming from the module load command but I'm having trouble figuring out what triggers it. The message is: /bin/bash: module: line 1: syntax error: unexpected end of file /bin/bash: error importing function definition for `module' 1 reply

Brian Vanderwende:

Yeah, this is a known bug we're working with PBS in which sometimes jobs with -V don't properly forward bash shell function definitions. Unfortunately there isn't any workaround aside from (a) not using -V or (b) manually redefining the module function definition at the start of your job. Using shell init flags to change behavior is irrelevant because PBS imposes the (broken) imported shell definitions after shell init.

samsrabin commented 6 months ago

Thanks for looking into this, Jim. It's probably overkill, but could da_no_data_mod.sh be rewritten in Python to avoid this issue?

jedwards4b commented 6 months ago

If you would like to try to rewrite that script in python, you are welcome to. I just tried removing the -V flag in the pbs and that seems to work. https://github.com/jedwards4b/ccs_config_cesm/tree/pbs_V_removed but it may have other side-effects, will need more testing.

samsrabin commented 6 months ago

Haha, thought you might say that! I say let's hope removing the -V option works. Are you planning to do the testing yourself? If not, I can give it a shot with standard CTSM test suite (aux_clm) and let you know how it goes.

ekluzek commented 6 months ago

So the "-V" sends all of the environment variables to the batch job. It looks like the da_no_datamod.sh script doesn't use any env variables (other than one it extracts with xmlquery which is fine). I would think that the only env variables would be in the env*.xml files and handled similarly or that the cime case logic would get whatever it needs from the case before it uses it. So it actually might be good to run this way to make sure things are set where they need to be rather than assumed from the batch system.

So it seems likely that removing "-V" should work. But, the change is universal for all PBS systems which means potentially it could break on a different system while working on Derecho. But, yes something that should go through quite a bit of testing...

jedwards4b commented 6 months ago

@ekluzek If you look at the PR you will see that I only made this change for ncar systems.