Improve (or remove??) fallback call in run_python_script()

samsrabin commented 4 days ago

Background

Sometimes a SystemTest needs to call a Python script via subprocess. This is always risky because it introduces the possibility for the user environment to cause errors. However, it's sometimes needed because—at least it was when I was first developing the RXCROPMATURITY test—because it's the only way to have a conda environment active during the RUN step.

During that test's development, it became obvious that my first implementation wasn't working for everyone, unsurprisingly due to user environment differences. I thus implemented a helper function that first tries to call the requested script with conda run -n ctsm_pylib python script.py .... If that fails, it then tries conda activate ctsm_pylib && python script.py ....

The issue

Unfortunately, it doesn't distinguish "the first try failed because of user environment issues" from "the first try failed because of an error in script.py." This can lead to confusion when a dev (quite reasonably!) looks at the test.RXCROPMATURITY.*.o* or script.py.log2 files, which only show conda-related messages. Only looking at the script.py.log file (i.e., the log from the first attempt) will show the error in script.py itself.

Suggested fix

I think it should be possible for the run_python_script() helper function to check the log file from the first attempt and only try the fallback if no Python error is detected there. If the fallback fails because of a conda issue, the user should be instructed to check the log from the first attempt.

It might even be possible to completely remove the fallback at this point? It was only added for some devs with weird environments, and those issues might not be around anymore.

Example

This happened in the test at /glade/derecho/scratch/slevis/tests_1113-122038de/RXCROPMATURITYSKIPGEN_Ld1097.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-cropMonthOutput.GC.1113-122038de_int.

From test.RXCROPMATURITYSKIPGEN_Ld1097.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-cropMonthOutput.GC.1113-122038de_int.o6616271, which at least suggests a possible error in the called script (although it's buried in a LOT of text):

ERROR while getting the conda environment and/or
running the check_rxboth_run.py tool:
(1) If your ctsm_pylib environment is out of date or you
have not created the ctsm_pylib environment, yet, you may
get past this error by running ./py_env_create
in your ctsm directory and trying this test again.
(2) If conda is not available, install and load conda,
run ./py_env_create, and then try this test again.
(3) If (1) and (2) are not the issue, then you may be
getting an error within check_rxboth_run.py itself.

From check_rxboth_run.py.log2:

CondaError: Run 'conda init' before 'conda activate'

Only check_rxboth_run.py.log shows the actual problem:

Traceback (most recent call last):
  File "/glade/work/slevis/conda-envs/ctsm_pylib/lib/python3.7/site-packages/xarray/core/dataset.py", line 1305, in _construct_dataarray
    variable = self._variables[name]
KeyError: 'time_bounds'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/glade/work/slevis/git/hXi_hXa_tapes_iss1059/python/ctsm/crop_calendars/check_rxboth_run.py", line 172, in <module>
    main(sys.argv[1:])
  File "/glade/work/slevis/git/hXi_hXa_tapes_iss1059/python/ctsm/crop_calendars/check_rxboth_run.py", line 95, in main
    throw_errors=False,
  File "/glade/work/slevis/git/hXi_hXa_tapes_iss1059/python/ctsm/crop_calendars/../../../python/ctsm/crop_calendars/cropcal_module.py", line 447, in import_output
    {"cftime": this_ds["time_bounds"].isel({"hist_interval": 0})}
  File "/glade/work/slevis/conda-envs/ctsm_pylib/lib/python3.7/site-packages/xarray/core/dataset.py", line 1409, in __getitem__
    return self._construct_dataarray(key)
  File "/glade/work/slevis/conda-envs/ctsm_pylib/lib/python3.7/site-packages/xarray/core/dataset.py", line 1308, in _construct_dataarray
    self._variables, name, self._level_coords, self.dims
  File "/glade/work/slevis/conda-envs/ctsm_pylib/lib/python3.7/site-packages/xarray/core/dataset.py", line 171, in _get_virtual_variable
    ref_var = variables[ref_name]
KeyError: 'time_bounds'

samsrabin commented 4 days ago

Added "next" label so that this gets brought up in the SE meeting, making sure everyone knows to look in BOTH log files.

wwieder commented 4 days ago

@adrifoster suggested using popen to maybe fix this?

ESCOMP / CTSM