LLNL / maestrowf

A tool to easily orchestrate general computational workflows both locally and on supercomputers
https://maestrowf.readthedocs.io
MIT License
134 stars 43 forks source link

jobs stuck in PENDING state but they've actually completed or are currently running #441

Open BenWibking opened 6 months ago

BenWibking commented 6 months ago

I've encountered the confusing situation of having the conductor still running as a background process, my jobs have all completed successfully, but running maestro status lists the jobs that were run via SLURM as still "PENDING".

Is there any way to figure out what has gone wrong, or otherwise reset the conductor process?

Running ps aus | grep $USER, I see:

login4.stampede3(626)$ ps aux | grep $USER
bwibking 1549018  0.0  0.0  20820 11936 ?        Ss   Apr30   0:09 /usr/lib/systemd/systemd --user
bwibking 1549019  0.0  0.0 202568  6848 ?        S    Apr30   0:00 (sd-pam)
bwibking 1725176  0.0  0.0   8436  3692 ?        S    Apr30   0:00 /bin/sh -c nohup conductor -t 60 -d 2 /scratch/02661/bwibking/precipitator-paper/outputs/medres_compressive_20240430-160408 > /scratch/02661/bwibking/precipitator-paper/outputs/medres_compressive_20240430-160408/medres_compressive.txt 2>&1
bwibking 1725177  0.0  0.0 325552 67876 ?        S    Apr30   0:32 /scratch/projects/compilers/intel24.0/oneapi/intelpython/python3.9/bin/python3.9 /home1/02661/bwibking/.local/bin/conductor -t 60 -d 2 /scratch/02661/bwibking/precipitator-paper/outputs/medres_compressive_20240430-160408
root     3152422  0.0  0.0  39932 11976 ?        Ss   13:31   0:00 sshd: bwibking [priv]
bwibking 3152677  0.0  0.0  40240  7472 ?        S    13:31   0:00 sshd: bwibking@pts/122
bwibking 3152678  0.0  0.0  17920  6116 pts/122  Ss   13:31   0:00 -bash
root     3163591  0.0  0.0  39932 12044 ?        Ss   13:43   0:00 sshd: bwibking [priv]
bwibking 3163629  0.0  0.0  40076  7236 ?        S    13:43   0:00 sshd: bwibking@notty
root     3168472  0.0  0.0  39932 12020 ?        Ss   13:48   0:00 sshd: bwibking [priv]
bwibking 3168505  0.0  0.0  40080  7208 ?        S    13:48   0:00 sshd: bwibking@notty
bwibking 3174162  0.0  0.0  20408  4220 pts/122  R+   13:54   0:00 ps aux
bwibking 3174163  0.0  0.0   7592  2836 pts/122  S+   13:54   0:00 grep --color=auto bwibking

Partial output from maestro status:

run-sim_amp.0.0002222222222222222.f_sol.0.0.tc_tff.1.333521432163324                   3  run-sim/amp.0.0002222222222222222.f_sol.0.0.tc_tff.1.333521432163324            PENDING   --:--:--        --:--:--        --                   2024-04-30 16:06:52  --                                   0
run-sim_amp.0.0003888888888888888.f_sol.0.0.tc_tff.4.216965034285822                   3  run-sim/amp.0.0003888888888888888.f_sol.0.0.tc_tff.4.216965034285822            PENDING   --:--:--        --:--:--        --                   2024-04-30 16:06:54  --                                   0
run-sim_amp.0.00011111111111111112.f_sol.0.0.tc_tff.2.371373705661655                  3  run-sim/amp.0.00011111111111111112.f_sol.0.0.tc_tff.2.371373705661655           PENDING   --:--:--        --:--:--        --                   2024-04-30 16:06:55  --                                   0
run-sim_amp.0.0002777777777777778.f_sol.0.0.tc_tff.7.498942093324558                   3  run-sim/amp.0.0002777777777777778.f_sol.0.0.tc_tff.7.498942093324558            PENDING   --:--:--        --:--:--        --                   2024-04-30 16:06:57  --                                   0
run-sim_amp.0.00044444444444444447.f_sol.0.0.tc_tff.1.1547819846894583                 3  run-sim/amp.0.00044444444444444447.f_sol.0.0.tc_tff.1.1547819846894583          PENDING   --:--:--        --:--:--        --                   2024-04-30 16:06:58  --                                   0
run-sim_amp.1.8518518518518515e-05.f_sol.0.0.tc_tff.3.651741272548377                  3  run-sim/amp.1.8518518518518515e-05.f_sol.0.0.tc_tff.3.651741272548377           PENDING   --:--:--        --:--:--        --                   2024-04-30 16:07:00  --                                   0
run-sim_amp.0.00018518518518518518.f_sol.0.0.tc_tff.2.0535250264571463                 3  run-sim/amp.0.00018518518518518518.f_sol.0.0.tc_tff.2.0535250264571463          PENDING   --:--:--        --:--:--        --                   2024-04-30 16:07:02  --                                   0
run-sim_amp.0.0003518518518518519.f_sol.0.0.tc_tff.6.493816315762113                   3  run-sim/amp.0.0003518518518518519.f_sol.0.0.tc_tff.6.493816315762113            PENDING   --:--:--        --:--:--        --                   2024-04-30 16:07:04  --                                   0
run-sim_amp.7.407407407407406e-05.f_sol.0.0.tc_tff.1.539926526059492                   3  run-sim/amp.7.407407407407406e-05.f_sol.0.0.tc_tff.1.539926526059492            PENDING   --:--:--        --:--:--        --                   2024-04-30 16:07:06  --                                   0
run-sim_amp.0.00024074074074074072.f_sol.0.0.tc_tff.4.869675251658631                  3  run-sim/amp.0.00024074074074074072.f_sol.0.0.tc_tff.4.869675251658631           PENDING   --:--:--        --:--:--        --                   2024-04-30 16:07:08  --                                   0
jwhite242 commented 6 months ago

Few questions for you:

What version of maestro is this with?

Also, does the sacct command exist on your system?

We'd seen similar issues before with slurm, and at least one fix went into 1.1.10 for it: the previous default usage of squeue by maestro to check the job status proved unreliable as slurm can flush jobs from that queue before maestro can see them, thus it loses track of the jobs and they can't ever get out of the PENDING or whatever their last known state was. A fallback to sacct was added in 1.1.10 for when it can't find them in squeue anymore to try and prevent this.

That being said, at least one user had reported having to get their cluster admins to turn on the sacct as it wasn't there for them. As far as I know (but definitely correct me if there is yet another way) sacct is the only way to really get at the job status after it's no longer in the scheduling queues (which is what squeue is looking at). There probably are some other corner cases where sacct can lose track of things if the db resets/gets flushed at some point to reset the jobid numbers, but so far I've not heard of that happening to a user.

BenWibking commented 6 months ago

This is version 1.10.0:

login4.stampede3(1012)$ maestro --version
maestro 1.1.10

sacct exists and shows the jobs that Maestro submitted:

login4.stampede3(1013)$ sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
503983       run-sim_a+        spr tg-ast090+        112  COMPLETED      0:0
503983.batch      batch            tg-ast090+        112  COMPLETED      0:0
504003       run-sim_a+        spr tg-ast090+        112  COMPLETED      0:0
504003.batch      batch            tg-ast090+        112  COMPLETED      0:0
504004       run-sim_a+        spr tg-ast090+        112  COMPLETED      0:0
504004.batch      batch            tg-ast090+        112  COMPLETED      0:0
504005       run-sim_a+        spr tg-ast090+        112  COMPLETED      0:0
504005.batch      batch            tg-ast090+        112  COMPLETED      0:0
504006       run-sim_a+        spr tg-ast090+        112  COMPLETED      0:0
504006.batch      batch            tg-ast090+        112  COMPLETED      0:0
504007       run-sim_a+        spr tg-ast090+        112  COMPLETED      0:0
504007.batch      batch            tg-ast090+        112  COMPLETED      0:0
504008       run-sim_a+        spr tg-ast090+        112  COMPLETED      0:0
504008.batch      batch            tg-ast090+        112  COMPLETED      0:0
504009       run-sim_a+        spr tg-ast090+        112  COMPLETED      0:0
504009.batch      batch            tg-ast090+        112  COMPLETED      0:0
504010       run-sim_a+        spr tg-ast090+        112  COMPLETED      0:0
504010.batch      batch            tg-ast090+        112  COMPLETED      0:0
504011       run-sim_a+        spr tg-ast090+        112  COMPLETED      0:0
504011.batch      batch            tg-ast090+        112  COMPLETED      0:0
504012       run-sim_a+        spr tg-ast090+        112  COMPLETED      0:0
504012.batch      batch            tg-ast090+        112  COMPLETED      0:0
504013       run-sim_a+        spr tg-ast090+        112  COMPLETED      0:0
504013.batch      batch            tg-ast090+        112  COMPLETED      0:0
504014       run-sim_a+        spr tg-ast090+        112  COMPLETED      0:0
504014.batch      batch            tg-ast090+        112  COMPLETED      0:0
FrankD412 commented 6 months ago

@BenWibking -- do you happen to have the log to the study that saw the issue?

BenWibking commented 6 months ago

@BenWibking -- do you happen to have the log to the study that saw the issue?

Here is the log: medres_compressive.log.zip

jwhite242 commented 6 months ago

Well that log doesn't show anything the explains what's going on. In future runs, can you try submitting with the debug log level turned on? -> https://maestrowf.readthedocs.io/en/latest/Maestro/cli.html#maestro, or: maestro -d 1 run .... Note it goes before the 'run' cmd as it's a global option.

Seems our default log levels aren't quite enough to capture what's going on here (and maybe just time to make an always there debug log to make sure we capture it all the time).

BenWibking commented 6 months ago

Ok, I've run a new study with -d 1 and I see additional DEBUG entries in the log. I'll upload this log if I see the same problem.

BenWibking commented 6 months ago

It indeed happened again with this run. Many of jobs in the PENDING state should actually be in the RUNNING state, since SLURM shows that they are in fact running.

Here is the study log, the output of squeue --me, and the output of maestro status my_study --layout legacy | grep run-sim: maestro_slurm_logs.zip

There also seems to be some sort of false dependency here, maybe due to mixing sets and lists somewhere in the code?

2024-05-06 19:03:04,753 - maestrowf.datastructures.core.executiongraph:execute_ready_steps:870 - DEBUG - Checking run-sim_amp.0.00030452674897119345.f_sol.0.0.tc_tff.1.7465760476621182 -- []
2024-05-06 19:03:04,753 - maestrowf.datastructures.core.executiongraph:execute_ready_steps:874 - DEBUG - 'run-sim_amp.0.00030452674897119345.f_sol.0.0.tc_tff.1.7465760476621182' found to be initialized. Checking dependencies. 
2024-05-06 19:03:04,753 - maestrowf.datastructures.core.executiongraph:execute_ready_steps:877 - DEBUG - Unfulfilled dependencies: set()
2024-05-06 19:03:04,753 - maestrowf.datastructures.core.executiongraph:execute_ready_steps:886 - DEBUG - Completed dependencies: []
Remaining dependencies: set()
2024-05-06 19:03:04,753 - maestrowf.datastructures.core.executiongraph:execute_ready_steps:897 - DEBUG - Already staged. Passing.
jwhite242 commented 6 months ago

Well, I've still got some digging to do in here as there is some suspicious behavior as i've found a few of those steps that start out having some unfulfilled dependencies that never get moved into completed when removed from unfulfilled. not sure if it's a bug in dependencies yet or just some bad log printouts (found a few issues there too) or something else. But, it does look like it's setting up the edges/dependencies correctly at the start. e.g. this corresponding to your snippet (as well as the parent generate profile that's connected to the generate-infile):

2024-05-06 16:43:05,949 - maestrowf.datastructures.core.study:_stage:736 - INFO - Adding edge (generate-infile_amp.0.00030452674897119345.f_sol.0.0.tc_tff.1.7465760476621182, run-sim_amp.0.00030452674897119345.f_sol.0.0.tc_tff.1.7465760476621182)...
2024-05-06 16:43:05,949 - root:add_edge:102 - DEBUG - Edge (generate-infile_amp.0.00030452674897119345.f_sol.0.0.tc_tff.1.7465760476621182, run-sim_amp.0.00030452674897119345.f_sol.0.0.tc_tff.1.7465760476621182) added.

Thanks for the debug log on this one though; much more helpful info so far!

BenWibking commented 6 months ago

Yeah, all of the pre-processing dependencies seem to work fine. But once it gets to the SLURM jobs, things seem to stop behaving as expected.

jwhite242 commented 6 months ago

Well, still digging for what things actually went wrong here...

Still looking for how the jobid is getting overwritten with an invalid value and then promptly losing ability to check any of those jobs again. Does that number 3 ever show up in the jobid column of the status?

BenWibking commented 6 months ago

I checked and it shows that all of the pending SLURM jobs have jobid values of 3 reported when doing maestro status --layout legacy, e.g.:

run-sim_amp.0.0002222222222222222.f_sol.0.0.tc_tff.1.333521432163324            3         run-sim/amp.0.0002222222222222222.f_sol.0.0.tc_tff.1.333521432163324            PENDING      --:--:--        --:--:--        --                   2024-05-07 16:56:59  --                                   0
run-sim_amp.0.0003888888888888888.f_sol.0.0.tc_tff.4.216965034285822            3         run-sim/amp.0.0003888888888888888.f_sol.0.0.tc_tff.4.216965034285822            PENDING      --:--:--        --:--:--        --                   2024-05-07 16:57:00  --                                   0
run-sim_amp.0.00011111111111111112.f_sol.0.0.tc_tff.2.371373705661655           3         run-sim/amp.0.00011111111111111112.f_sol.0.0.tc_tff.2.371373705661655           PENDING      --:--:--        --:--:--        --                   2024-05-07 16:57:02  --                                   0

I am wondering if this might be due to having some strange custom formatting turned on for SLURM output on this cluster. How does Maestro get the jobids from SLURM?

BenWibking commented 6 months ago

Ok, that's the issue. I see that maestro extracts the jobid by parsing the output of sbatch with this regex: https://github.com/LLNL/maestrowf/blob/b9ea2f01dbcfc44000cf730019d8db3189209446/maestrowf/interfaces/script/slurmscriptadapter.py#L241

On this machine, submitting a job with sbatch produces output like this:

$ sbatch run-sim_amp.0.0004259259259259259.f_sol.0.0.tc_tff.8.058421877614817.slurm.sh

-----------------------------------------------------------------
          Welcome to the Stampede3 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login4)...OK
--> Verifying valid jobname...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (spr)...OK
--> Checking available allocation (TG-AST090040)...OK
--> Quotas are not currently enabled for filesystem /home1/02661/bwibking...OK
--> Verifying that quota for filesystem /work2/02661/bwibking/stampede3 is at  0.00% allocated...OK
Submitted batch job 519605

The above regex then matches the 3 in the Welcome to the Stampede3 Supercomputer.

jwhite242 commented 6 months ago

Well, that's definitely a new one; was gonna say, we do fix the squeue and sacct formats to guard against users customization from breaking this since that's happened before. Good to know the sbatch output can actually be customized like that! Think this will be a reasonably easy fix pending some digging in slurms docs to see if we can reliably expect that 'submitted batch job ' line in it or whether that's always at the end.

BenWibking commented 6 months ago

I tried adding --parsable to the sbatch flags but I still got some extraneous output:

$ sbatch --parsable run-sim_amp.0.0004259259259259259.f_sol.0.0.tc_tff.8.058421877614817.slurm.sh

-----------------------------------------------------------------
          Welcome to the Stampede3 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login4)...OK
--> Verifying valid jobname...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (spr)...OK
--> Checking available allocation (TG-AST090040)...OK
--> Quotas are not currently enabled for filesystem /home1/02661/bwibking...OK
--> Verifying that quota for filesystem /work2/02661/bwibking/stampede3 is at  0.00% allocated...OK
519781

SLURM may have been modified on this system. This site has a tendency to have unexpected customizations.

FrankD412 commented 6 months ago

I think the Submitted batch job <ID> format was standard if I'm not mistaken. It could be a simple regex for the initial submitted message and then parsing that line. That seems like the easiest fix.

Hindsight, it was probably not a good idea to just regex for a digit... here is where the regex happens.

BenWibking commented 6 months ago

I've resubmitted the study above that failed with PR #443 and it assigns valid jobids to SLURM jobs and correctly lists them as running:

run-sim_amp.0.0002222222222222222.f_sol.0.0.tc_tff.1.333521432163324            520134    run-sim/amp.0.0002222222222222222.f_sol.0.0.tc_tff.1.333521432163324            RUNNING      --:--:--        0d:00h:09m:57s  2024-05-08 14:44:34  2024-05-08 14:42:45  --                                   0
run-sim_amp.0.0003888888888888888.f_sol.0.0.tc_tff.4.216965034285822            520135    run-sim/amp.0.0003888888888888888.f_sol.0.0.tc_tff.4.216965034285822            RUNNING      --:--:--        0d:00h:09m:55s  2024-05-08 14:44:34  2024-05-08 14:42:47  --                                   0
run-sim_amp.0.00011111111111111112.f_sol.0.0.tc_tff.2.371373705661655           520136    run-sim/amp.0.00011111111111111112.f_sol.0.0.tc_tff.2.371373705661655           RUNNING      --:--:--        0d:00h:09m:54s  2024-05-08 14:44:34  2024-05-08 14:42:48  --                                   0
jwhite242 commented 6 months ago

Think this might be what's adding all that extra output on your cluster: https://slurm.schedmd.com/job_submit_plugins.html with some custom validation/logging messages always spit out (I'll wager srun and salloc do the same?). Haven't found anything yet about whether the order is always to dump out all these log messages prior to the job id line, but seems likely given most of this happens before the actual submission. Almost think it might be safer to leave the --parseable option off and have that regex account for 'Submitted batch job' prefix on the line so we can be sure the number we detect is actually the one we want?

But, glad to hear your fix has got you running again in the meantime; i'll be disappearing for a day or so here and won't be able to review/approve that fix till tomorrow/friday at the earliest. And thanks again both for the pr to fix and for tracking down the source of the issue, really appreciate that!

BenWibking commented 6 months ago

Think this might be what's adding all that extra output on your cluster: https://slurm.schedmd.com/job_submit_plugins.html with some custom validation/logging messages always spit out (I'll wager srun and salloc do the same?). Haven't found anything yet about whether the order is always to dump out all these log messages prior to the job id line, but seems likely given most of this happens before the actual submission. Almost think it might be safer to leave the --parseable option off and have that regex account for 'Submitted batch job' prefix on the line so we can be sure the number we detect is actually the one we want?

Leaving --parsable off for that reason makes sense to me. I think this is the relevant sbatch source code: https://github.com/SchedMD/slurm/blob/f34478ab60c1ee530ef6d20352ef99662d87e131/src/sbatch/sbatch.c#L332

But, glad to hear your fix has got you running again in the meantime; i'll be disappearing for a day or so here and won't be able to review/approve that fix till tomorrow/friday at the earliest. And thanks again both for the pr to fix and for tracking down the source of the issue, really appreciate that!

đź‘Ť