Sometimes a SystemTest needs to call a Python script via subprocess. This is always risky because it introduces the possibility for the user environment to cause errors. However, it's sometimes needed because—at least it was when I was first developing the RXCROPMATURITY test—because it's the only way to have a conda environment active during the RUN step.
During that test's development, it became obvious that my first implementation wasn't working for everyone, unsurprisingly due to user environment differences. I thus implemented a helper function that first tries to call the requested script with conda run -n ctsm_pylib python script.py .... If that fails, it then tries conda activate ctsm_pylib && python script.py ....
The issue
Unfortunately, it doesn't distinguish "the first try failed because of user environment issues" from "the first try failed because of an error in script.py." This can lead to confusion when a dev (quite reasonably!) looks at the test.RXCROPMATURITY.*.o* or script.py.log2 files, which only show conda-related messages. Only looking at the script.py.log file (i.e., the log from the first attempt) will show the error in script.py itself.
Suggested fix
I think it should be possible for the run_python_script() helper function to check the log file from the first attempt and only try the fallback if no Python error is detected there. If the fallback fails because of a conda issue, the user should be instructed to check the log from the first attempt.
It might even be possible to completely remove the fallback at this point? It was only added for some devs with weird environments, and those issues might not be around anymore.
Example
This happened in the test at /glade/derecho/scratch/slevis/tests_1113-122038de/RXCROPMATURITYSKIPGEN_Ld1097.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-cropMonthOutput.GC.1113-122038de_int.
From test.RXCROPMATURITYSKIPGEN_Ld1097.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-cropMonthOutput.GC.1113-122038de_int.o6616271, which at least suggests a possible error in the called script (although it's buried in a LOT of text):
ERROR while getting the conda environment and/or
running the check_rxboth_run.py tool:
(1) If your ctsm_pylib environment is out of date or you
have not created the ctsm_pylib environment, yet, you may
get past this error by running ./py_env_create
in your ctsm directory and trying this test again.
(2) If conda is not available, install and load conda,
run ./py_env_create, and then try this test again.
(3) If (1) and (2) are not the issue, then you may be
getting an error within check_rxboth_run.py itself.
From check_rxboth_run.py.log2:
CondaError: Run 'conda init' before 'conda activate'
Only check_rxboth_run.py.log shows the actual problem:
Traceback (most recent call last):
File "/glade/work/slevis/conda-envs/ctsm_pylib/lib/python3.7/site-packages/xarray/core/dataset.py", line 1305, in _construct_dataarray
variable = self._variables[name]
KeyError: 'time_bounds'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/glade/work/slevis/git/hXi_hXa_tapes_iss1059/python/ctsm/crop_calendars/check_rxboth_run.py", line 172, in <module>
main(sys.argv[1:])
File "/glade/work/slevis/git/hXi_hXa_tapes_iss1059/python/ctsm/crop_calendars/check_rxboth_run.py", line 95, in main
throw_errors=False,
File "/glade/work/slevis/git/hXi_hXa_tapes_iss1059/python/ctsm/crop_calendars/../../../python/ctsm/crop_calendars/cropcal_module.py", line 447, in import_output
{"cftime": this_ds["time_bounds"].isel({"hist_interval": 0})}
File "/glade/work/slevis/conda-envs/ctsm_pylib/lib/python3.7/site-packages/xarray/core/dataset.py", line 1409, in __getitem__
return self._construct_dataarray(key)
File "/glade/work/slevis/conda-envs/ctsm_pylib/lib/python3.7/site-packages/xarray/core/dataset.py", line 1308, in _construct_dataarray
self._variables, name, self._level_coords, self.dims
File "/glade/work/slevis/conda-envs/ctsm_pylib/lib/python3.7/site-packages/xarray/core/dataset.py", line 171, in _get_virtual_variable
ref_var = variables[ref_name]
KeyError: 'time_bounds'
Background
Sometimes a SystemTest needs to call a Python script via
subprocess
. This is always risky because it introduces the possibility for the user environment to cause errors. However, it's sometimes needed because—at least it was when I was first developing the RXCROPMATURITY test—because it's the only way to have a conda environment active during the RUN step.During that test's development, it became obvious that my first implementation wasn't working for everyone, unsurprisingly due to user environment differences. I thus implemented a helper function that first tries to call the requested script with
conda run -n ctsm_pylib python script.py ...
. If that fails, it then triesconda activate ctsm_pylib && python script.py ...
.The issue
Unfortunately, it doesn't distinguish "the first try failed because of user environment issues" from "the first try failed because of an error in
script.py
." This can lead to confusion when a dev (quite reasonably!) looks at thetest.RXCROPMATURITY.*.o*
orscript.py.log2
files, which only show conda-related messages. Only looking at thescript.py.log
file (i.e., the log from the first attempt) will show the error inscript.py
itself.Suggested fix
I think it should be possible for the
run_python_script()
helper function to check the log file from the first attempt and only try the fallback if no Python error is detected there. If the fallback fails because of a conda issue, the user should be instructed to check the log from the first attempt.It might even be possible to completely remove the fallback at this point? It was only added for some devs with weird environments, and those issues might not be around anymore.
Example
This happened in the test at
/glade/derecho/scratch/slevis/tests_1113-122038de/RXCROPMATURITYSKIPGEN_Ld1097.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-cropMonthOutput.GC.1113-122038de_int
.From
test.RXCROPMATURITYSKIPGEN_Ld1097.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-cropMonthOutput.GC.1113-122038de_int.o6616271
, which at least suggests a possible error in the called script (although it's buried in a LOT of text):From
check_rxboth_run.py.log2
:Only
check_rxboth_run.py.log
shows the actual problem: