LLNL / maestrowf

A tool to easily orchestrate general computational workflows both locally and on supercomputers
https://maestrowf.readthedocs.io
MIT License
134 stars 43 forks source link

jobs are "INITIALIZED" but not starting #435

Closed BenWibking closed 9 months ago

BenWibking commented 9 months ago

I have a test study running on my laptop that appears stuck in this state:

===================================================================================================================================================================================
Step Name             Job ID    Workspace             State        Run Time        Elapsed Time    Start Time           Submit Time          End Time               Number Restarts
--------------------  --------  --------------------  -----------  --------------  --------------  -------------------  -------------------  -------------------  -----------------
generate-profile_0.1  26102     generate-profile/0.1  FINISHED     0d:00h:00m:02s  0d:00h:00m:02s  2024-01-29 13:54:55  2024-01-29 13:54:55  2024-01-29 13:54:57                  0
generate-profile_0.3  26108     generate-profile/0.3  FINISHED     0d:00h:00m:02s  0d:00h:00m:02s  2024-01-29 13:54:57  2024-01-29 13:54:57  2024-01-29 13:54:59                  0
generate-profile_1.0  26111     generate-profile/1.0  FINISHED     0d:00h:00m:02s  0d:00h:00m:02s  2024-01-29 13:54:59  2024-01-29 13:54:59  2024-01-29 13:55:01                  0
generate-infile_0.1   26303     generate-infile/0.1   FINISHED     0d:00h:00m:02s  0d:00h:00m:02s  2024-01-29 13:56:01  2024-01-29 13:56:01  2024-01-29 13:56:03                  0
generate-infile_0.3   26322     generate-infile/0.3   FINISHED     0d:00h:00m:01s  0d:00h:00m:01s  2024-01-29 13:56:03  2024-01-29 13:56:03  2024-01-29 13:56:04                  0
generate-infile_1.0   26340     generate-infile/1.0   FINISHED     0d:00h:00m:02s  0d:00h:00m:02s  2024-01-29 13:56:04  2024-01-29 13:56:04  2024-01-29 13:56:06                  0
run-sim_0.1           --        run-sim/0.1           INITIALIZED  --:--:--        --:--:--        --                   --                   --                                   0
run-sim_0.3           --        run-sim/0.3           INITIALIZED  --:--:--        --:--:--        --                   --                   --                                   0
run-sim_1.0           --        run-sim/1.0           INITIALIZED  --:--:--        --:--:--        --                   --                   --                                   0
===================================================================================================================================================================================

The subdirectories for run-sim_0.1, run-sim_0.3, and run-sim_1.0 don't have any files in them, except for the subdirectory for run-sim_0.1, which has a bash script that was generated from the workflow.

Is there any way to figure out what it's doing and why it appears to be stuck?

BenWibking commented 9 months ago

top shows that the simulation correspoinding to run-sim_0.1 is running.

Is there some output buffering that would explain why I don't see any log files?

jwhite242 commented 9 months ago

Unfortunately, that's currently expected i think for the local adapter. That one currently appears to run in a blocking manner and waits for the subprocess (steps' bash script) to finish before it writes out the .out/.err log files. We do plan to unblock that with an executor backend to make it behave like the HPC adapters, but that's currently only in a dev branch at the moment.

BenWibking commented 9 months ago

Thanks for the explanation and quick reply.