E3SM-Project / polaris

Testing and analysis for OMEGA, MPAS-Ocean, MALI and MPAS-Seaice
BSD 3-Clause "New" or "Revised" License
6 stars 13 forks source link

Add dependency in restart test case #95

Closed xylar closed 1 year ago

xylar commented 1 year ago

Because we don't know the filename of the restart file at setup time, we need to instead make the full_run step an explicit dependency of the restart_run step.

Checklist

xylar commented 1 year ago

Testing

The restart_test and the rest of the PR suite passed on Chrysalis and are BFB with a baseline using main.

xylar commented 1 year ago

This will need to be rebased and conflicts fixed after #96 goes in.

xylar commented 1 year ago

@altheaden, that's odd. The error you see isn't what I expected or what I see when I try the same. I see:

$ cd init/
$ polaris serial
...
$ cd ../restart_run
$ polaris serial
polaris calling: polaris.run.serial._run_test()
  in /home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py

Traceback (most recent call last):
  File "/home/xylar/mambaforge/envs/polaris_test/bin/polaris", line 33, in <module>
    sys.exit(load_entry_point('polaris', 'console_scripts', 'polaris')())
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/__main__.py", line 62, in main
    commands[args.command]()
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 176, in main
    run_single_step(args.step_is_subprocess)
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 134, in run_single_step
    _run_test(test_case, available_resources)
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 409, in _run_test
    _run_step(test_case, step, test_case.new_step_log_file,
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 431, in _run_step
    raise OSError(
OSError: input file(s) missing in step restart_run of ocean/baroclinic_channel/10km/restart: ['/home/xylar/data/polaris_0.1/test_20230726/fix-bcn-restart/ocean/baroclinic_channel/10km/restart/full_run/step_after_run.pickle']

That's what I was expecting to see -- it's complaining about an input rather than an output file.

xylar commented 1 year ago

I'm going to go ahead and merge but it would be good to know what the workflow was that produced the results you saw.

altheaden commented 1 year ago

I can recreate it today and see what the results are. As far as I can tell, I did the same process that you did, but let me see if my results are different this time around.

altheaden commented 1 year ago

@xylar Here is a longer version of the error message I get (not sure how much is useful for you to see), still ending in the same error. Not sure what I'm doing differently.

(polaris-test-2) [ac.althea@chr-0245 fix-restart-test-inputs-outputs]$ cd ocean/baroclinic_channel/10km/restart/init
(polaris-test-2) [ac.althea@chr-0245 init]$ polaris serial
...
(polaris-test-2) [ac.althea@chr-0245 init]$ cd ../restart_run/
(polaris-test-2) [ac.althea@chr-0245 restart_run]$ polaris serial
...
Bypassing step's run() method and running with command line args

polaris calling: polaris.parallel.run_command()
  in /gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/parallel.py

Running: srun -c 1 -N 1 -n 4 ./ocean_model -n namelist.ocean -s streams.ocean
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Traceback (most recent call last):
  File "/home/ac.althea/miniconda3/envs/polaris-test-2/bin/polaris", line 33, in <module>
    sys.exit(load_entry_point('polaris', 'console_scripts', 'polaris')())
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/__main__.py", line 62, in main
    commands[args.command]()
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 176, in main
    run_single_step(args.step_is_subprocess)
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 134, in run_single_step
    _run_test(test_case, available_resources)
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 409, in _run_test
    _run_step(test_case, step, test_case.new_step_log_file,
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 499, in _run_step
    raise OSError(
OSError: output file(s) missing in step restart_run of ocean/baroclinic_channel/10km/restart: ['/home/ac.althea/ac.althea/polaris_tests/baroclinic/fix-restart-test-inputs-outputs/ocean/baroclinic_channel/10km/restart/restart_run/output.nc']
xylar commented 1 year ago

@altheaden, is this in a directory where you already ran the command successfully once? Even if so, it's weird that it doesn't just run successfully and instead has errors. We would probably need to look at log.ocean.0000.err to see what the issue was that led to the MPI_ABORT.

But it seems like you're seeing a rather different and more unexpected behavior than I was seeing. Maybe let's let it be for now. If we see this again, we can investigate further.

altheaden commented 1 year ago

@xylar I actually just made sure to update the submodules and re-make before setting up the test again. Every time, I have been setting up a new directory and just doing the workflow I showed (cd init, polaris serial, cd restart_run, polaris serial). I just did it again and got the same results. Then, I went and manually ran the full run step before running the restart run step and they were both successful.

altheaden commented 1 year ago

I just checked the error files from my restart_run test, and they all just say that the files in the restarts directory don't exist.

altheaden commented 1 year ago
(polaris-test-2) [ac.althea@chr-0245 restart_run]$ cat *.err
----------------------------------------------------------------------
Beginning MPAS-ocean Error Log File for task       0 of       4
    Opened at 2023/07/26 13:41:20
----------------------------------------------------------------------

ERROR: Stream 'restart' attempted to read non-existent file '../restarts/rst.0001-01-01_00.05.00.nc'
ERROR: Error reading initial state in init
CRITICAL ERROR: Core init failed for core ocean
Logging complete.  Closing file at 2023/07/26 13:41:20
----------------------------------------------------------------------
Beginning MPAS-ocean Error Log File for task       1 of       4
    Opened at 2023/07/26 13:41:20
----------------------------------------------------------------------

ERROR: Stream 'restart' attempted to read non-existent file '../restarts/rst.0001-01-01_00.05.00.nc'
CRITICAL ERROR: Core init failed for core ocean
Logging complete.  Closing file at 2023/07/26 13:41:20
----------------------------------------------------------------------
Beginning MPAS-ocean Error Log File for task       2 of       4
    Opened at 2023/07/26 13:41:20
----------------------------------------------------------------------

ERROR: Stream 'restart' attempted to read non-existent file '../restarts/rst.0001-01-01_00.05.00.nc'
CRITICAL ERROR: Core init failed for core ocean
Logging complete.  Closing file at 2023/07/26 13:41:20
----------------------------------------------------------------------
Beginning MPAS-ocean Error Log File for task       3 of       4
    Opened at 2023/07/26 13:41:20
----------------------------------------------------------------------

ERROR: Stream 'restart' attempted to read non-existent file '../restarts/rst.0001-01-01_00.05.00.nc'
CRITICAL ERROR: Core init failed for core ocean
Logging complete.  Closing file at 2023/07/26 13:41:20
xylar commented 1 year ago

@altheaden, those all look lik errors I would have expected to see before this branch. Any chance you were accidentally testing from a different branch (e.g. an earlier version of main) rather than my fix-restart-test-inputs-outputs? That branch is now gone but you could test with the latest main and it should behave like my test.

But also, like I said, it's not critical to figure this out if you'd rather let it go.

altheaden commented 1 year ago

@xylar I just did as you asked and now I'm getting the same error you were, a missing input file. No idea why it was different for me before...