Open BenWibking opened 6 months ago
Few questions for you:
What version of maestro is this with?
Also, does the sacct
command exist on your system?
We'd seen similar issues before with slurm, and at least one fix went into 1.1.10 for it: the previous default usage of squeue
by maestro to check the job status proved unreliable as slurm can flush jobs from that queue before maestro can see them, thus it loses track of the jobs and they can't ever get out of the PENDING or whatever their last known state was. A fallback to sacct
was added in 1.1.10 for when it can't find them in squeue
anymore to try and prevent this.
That being said, at least one user had reported having to get their cluster admins to turn on the sacct as it wasn't there for them. As far as I know (but definitely correct me if there is yet another way) sacct is the only way to really get at the job status after it's no longer in the scheduling queues (which is what squeue is looking at). There probably are some other corner cases where sacct can lose track of things if the db resets/gets flushed at some point to reset the jobid numbers, but so far I've not heard of that happening to a user.
This is version 1.10.0:
login4.stampede3(1012)$ maestro --version
maestro 1.1.10
sacct
exists and shows the jobs that Maestro submitted:
login4.stampede3(1013)$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
503983 run-sim_a+ spr tg-ast090+ 112 COMPLETED 0:0
503983.batch batch tg-ast090+ 112 COMPLETED 0:0
504003 run-sim_a+ spr tg-ast090+ 112 COMPLETED 0:0
504003.batch batch tg-ast090+ 112 COMPLETED 0:0
504004 run-sim_a+ spr tg-ast090+ 112 COMPLETED 0:0
504004.batch batch tg-ast090+ 112 COMPLETED 0:0
504005 run-sim_a+ spr tg-ast090+ 112 COMPLETED 0:0
504005.batch batch tg-ast090+ 112 COMPLETED 0:0
504006 run-sim_a+ spr tg-ast090+ 112 COMPLETED 0:0
504006.batch batch tg-ast090+ 112 COMPLETED 0:0
504007 run-sim_a+ spr tg-ast090+ 112 COMPLETED 0:0
504007.batch batch tg-ast090+ 112 COMPLETED 0:0
504008 run-sim_a+ spr tg-ast090+ 112 COMPLETED 0:0
504008.batch batch tg-ast090+ 112 COMPLETED 0:0
504009 run-sim_a+ spr tg-ast090+ 112 COMPLETED 0:0
504009.batch batch tg-ast090+ 112 COMPLETED 0:0
504010 run-sim_a+ spr tg-ast090+ 112 COMPLETED 0:0
504010.batch batch tg-ast090+ 112 COMPLETED 0:0
504011 run-sim_a+ spr tg-ast090+ 112 COMPLETED 0:0
504011.batch batch tg-ast090+ 112 COMPLETED 0:0
504012 run-sim_a+ spr tg-ast090+ 112 COMPLETED 0:0
504012.batch batch tg-ast090+ 112 COMPLETED 0:0
504013 run-sim_a+ spr tg-ast090+ 112 COMPLETED 0:0
504013.batch batch tg-ast090+ 112 COMPLETED 0:0
504014 run-sim_a+ spr tg-ast090+ 112 COMPLETED 0:0
504014.batch batch tg-ast090+ 112 COMPLETED 0:0
@BenWibking -- do you happen to have the log to the study that saw the issue?
@BenWibking -- do you happen to have the log to the study that saw the issue?
Here is the log: medres_compressive.log.zip
Well that log doesn't show anything the explains what's going on. In future runs, can you try submitting with the debug log level turned on? -> https://maestrowf.readthedocs.io/en/latest/Maestro/cli.html#maestro, or: maestro -d 1 run ...
. Note it goes before the 'run' cmd as it's a global option.
Seems our default log levels aren't quite enough to capture what's going on here (and maybe just time to make an always there debug log to make sure we capture it all the time).
Ok, I've run a new study with -d 1
and I see additional DEBUG
entries in the log. I'll upload this log if I see the same problem.
It indeed happened again with this run. Many of jobs in the PENDING state should actually be in the RUNNING state, since SLURM shows that they are in fact running.
Here is the study log, the output of squeue --me
, and the output of maestro status my_study --layout legacy | grep run-sim
:
maestro_slurm_logs.zip
There also seems to be some sort of false dependency here, maybe due to mixing sets and lists somewhere in the code?
2024-05-06 19:03:04,753 - maestrowf.datastructures.core.executiongraph:execute_ready_steps:870 - DEBUG - Checking run-sim_amp.0.00030452674897119345.f_sol.0.0.tc_tff.1.7465760476621182 -- []
2024-05-06 19:03:04,753 - maestrowf.datastructures.core.executiongraph:execute_ready_steps:874 - DEBUG - 'run-sim_amp.0.00030452674897119345.f_sol.0.0.tc_tff.1.7465760476621182' found to be initialized. Checking dependencies.
2024-05-06 19:03:04,753 - maestrowf.datastructures.core.executiongraph:execute_ready_steps:877 - DEBUG - Unfulfilled dependencies: set()
2024-05-06 19:03:04,753 - maestrowf.datastructures.core.executiongraph:execute_ready_steps:886 - DEBUG - Completed dependencies: []
Remaining dependencies: set()
2024-05-06 19:03:04,753 - maestrowf.datastructures.core.executiongraph:execute_ready_steps:897 - DEBUG - Already staged. Passing.
Well, I've still got some digging to do in here as there is some suspicious behavior as i've found a few of those steps that start out having some unfulfilled dependencies that never get moved into completed when removed from unfulfilled. not sure if it's a bug in dependencies yet or just some bad log printouts (found a few issues there too) or something else. But, it does look like it's setting up the edges/dependencies correctly at the start. e.g. this corresponding to your snippet (as well as the parent generate profile that's connected to the generate-infile):
2024-05-06 16:43:05,949 - maestrowf.datastructures.core.study:_stage:736 - INFO - Adding edge (generate-infile_amp.0.00030452674897119345.f_sol.0.0.tc_tff.1.7465760476621182, run-sim_amp.0.00030452674897119345.f_sol.0.0.tc_tff.1.7465760476621182)...
2024-05-06 16:43:05,949 - root:add_edge:102 - DEBUG - Edge (generate-infile_amp.0.00030452674897119345.f_sol.0.0.tc_tff.1.7465760476621182, run-sim_amp.0.00030452674897119345.f_sol.0.0.tc_tff.1.7465760476621182) added.
Thanks for the debug log on this one though; much more helpful info so far!
Yeah, all of the pre-processing dependencies seem to work fine. But once it gets to the SLURM jobs, things seem to stop behaving as expected.
Well, still digging for what things actually went wrong here...
Still looking for how the jobid is getting overwritten with an invalid value and then promptly losing ability to check any of those jobs again. Does that number 3 ever show up in the jobid column of the status?
I checked and it shows that all of the pending SLURM jobs have jobid values of 3 reported when doing maestro status --layout legacy
, e.g.:
run-sim_amp.0.0002222222222222222.f_sol.0.0.tc_tff.1.333521432163324 3 run-sim/amp.0.0002222222222222222.f_sol.0.0.tc_tff.1.333521432163324 PENDING --:--:-- --:--:-- -- 2024-05-07 16:56:59 -- 0
run-sim_amp.0.0003888888888888888.f_sol.0.0.tc_tff.4.216965034285822 3 run-sim/amp.0.0003888888888888888.f_sol.0.0.tc_tff.4.216965034285822 PENDING --:--:-- --:--:-- -- 2024-05-07 16:57:00 -- 0
run-sim_amp.0.00011111111111111112.f_sol.0.0.tc_tff.2.371373705661655 3 run-sim/amp.0.00011111111111111112.f_sol.0.0.tc_tff.2.371373705661655 PENDING --:--:-- --:--:-- -- 2024-05-07 16:57:02 -- 0
I am wondering if this might be due to having some strange custom formatting turned on for SLURM output on this cluster. How does Maestro get the jobids from SLURM?
Ok, that's the issue. I see that maestro extracts the jobid by parsing the output of sbatch
with this regex: https://github.com/LLNL/maestrowf/blob/b9ea2f01dbcfc44000cf730019d8db3189209446/maestrowf/interfaces/script/slurmscriptadapter.py#L241
On this machine, submitting a job with sbatch
produces output like this:
$ sbatch run-sim_amp.0.0004259259259259259.f_sol.0.0.tc_tff.8.058421877614817.slurm.sh
-----------------------------------------------------------------
Welcome to the Stampede3 Supercomputer
-----------------------------------------------------------------
No reservation for this job
--> Verifying valid submit host (login4)...OK
--> Verifying valid jobname...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (spr)...OK
--> Checking available allocation (TG-AST090040)...OK
--> Quotas are not currently enabled for filesystem /home1/02661/bwibking...OK
--> Verifying that quota for filesystem /work2/02661/bwibking/stampede3 is at 0.00% allocated...OK
Submitted batch job 519605
The above regex then matches the 3
in the Welcome to the Stampede3 Supercomputer
.
Well, that's definitely a new one; was gonna say, we do fix the squeue and sacct formats to guard against users customization from breaking this since that's happened before. Good to know the sbatch output can actually be customized like that! Think this will be a reasonably easy fix pending some digging in slurms docs to see if we can reliably expect that 'submitted batch job
I tried adding --parsable
to the sbatch flags but I still got some extraneous output:
$ sbatch --parsable run-sim_amp.0.0004259259259259259.f_sol.0.0.tc_tff.8.058421877614817.slurm.sh
-----------------------------------------------------------------
Welcome to the Stampede3 Supercomputer
-----------------------------------------------------------------
No reservation for this job
--> Verifying valid submit host (login4)...OK
--> Verifying valid jobname...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (spr)...OK
--> Checking available allocation (TG-AST090040)...OK
--> Quotas are not currently enabled for filesystem /home1/02661/bwibking...OK
--> Verifying that quota for filesystem /work2/02661/bwibking/stampede3 is at 0.00% allocated...OK
519781
SLURM may have been modified on this system. This site has a tendency to have unexpected customizations.
I think the Submitted batch job <ID>
format was standard if I'm not mistaken. It could be a simple regex for the initial submitted message and then parsing that line. That seems like the easiest fix.
Hindsight, it was probably not a good idea to just regex for a digit... here is where the regex happens.
I've resubmitted the study above that failed with PR #443 and it assigns valid jobids to SLURM jobs and correctly lists them as running:
run-sim_amp.0.0002222222222222222.f_sol.0.0.tc_tff.1.333521432163324 520134 run-sim/amp.0.0002222222222222222.f_sol.0.0.tc_tff.1.333521432163324 RUNNING --:--:-- 0d:00h:09m:57s 2024-05-08 14:44:34 2024-05-08 14:42:45 -- 0
run-sim_amp.0.0003888888888888888.f_sol.0.0.tc_tff.4.216965034285822 520135 run-sim/amp.0.0003888888888888888.f_sol.0.0.tc_tff.4.216965034285822 RUNNING --:--:-- 0d:00h:09m:55s 2024-05-08 14:44:34 2024-05-08 14:42:47 -- 0
run-sim_amp.0.00011111111111111112.f_sol.0.0.tc_tff.2.371373705661655 520136 run-sim/amp.0.00011111111111111112.f_sol.0.0.tc_tff.2.371373705661655 RUNNING --:--:-- 0d:00h:09m:54s 2024-05-08 14:44:34 2024-05-08 14:42:48 -- 0
Think this might be what's adding all that extra output on your cluster: https://slurm.schedmd.com/job_submit_plugins.html with some custom validation/logging messages always spit out (I'll wager srun and salloc do the same?). Haven't found anything yet about whether the order is always to dump out all these log messages prior to the job id line, but seems likely given most of this happens before the actual submission. Almost think it might be safer to leave the --parseable option off and have that regex account for 'Submitted batch job' prefix on the line so we can be sure the number we detect is actually the one we want?
But, glad to hear your fix has got you running again in the meantime; i'll be disappearing for a day or so here and won't be able to review/approve that fix till tomorrow/friday at the earliest. And thanks again both for the pr to fix and for tracking down the source of the issue, really appreciate that!
Think this might be what's adding all that extra output on your cluster: https://slurm.schedmd.com/job_submit_plugins.html with some custom validation/logging messages always spit out (I'll wager srun and salloc do the same?). Haven't found anything yet about whether the order is always to dump out all these log messages prior to the job id line, but seems likely given most of this happens before the actual submission. Almost think it might be safer to leave the --parseable option off and have that regex account for 'Submitted batch job' prefix on the line so we can be sure the number we detect is actually the one we want?
Leaving --parsable off for that reason makes sense to me. I think this is the relevant sbatch source code: https://github.com/SchedMD/slurm/blob/f34478ab60c1ee530ef6d20352ef99662d87e131/src/sbatch/sbatch.c#L332
But, glad to hear your fix has got you running again in the meantime; i'll be disappearing for a day or so here and won't be able to review/approve that fix till tomorrow/friday at the earliest. And thanks again both for the pr to fix and for tracking down the source of the issue, really appreciate that!
đź‘Ť
I've encountered the confusing situation of having the conductor still running as a background process, my jobs have all completed successfully, but running
maestro status
lists the jobs that were run via SLURM as still "PENDING".Is there any way to figure out what has gone wrong, or otherwise reset the conductor process?
Running
ps aus | grep $USER
, I see:Partial output from
maestro status
: