Open doutriaux1 opened 4 years ago
I believe what you're seeing is the result that you have a local step in parallel with the scheduled step. Since the local step is long-running, the other scripts aren't generated until the first local step is done -- so everything is stuck behind that local step (script generation and everything). That's why you see the processes appear, but not the full set and you don't see the other scripts.
This behavior will be changing in the near future. The local execution will be parallelized and behave more like a scheduled step where processes will be started and the workflow will be allowed to continue. It does look like the conductor itself is still running, so Maestro hasn't crashed.
That said, the status that gets dumped gets written after the first set of steps -- which means the local execution of a slow step hold that up too. I think that better user feedback here might be to dump the status first so that it doesn't give the impression that the job failed.
@doutriaux1 -- What do you think?
@FrankD412 I agree the message is confusing and I ended up restarting maestro many times before realizing I was filling my node with long running jobs. So I agree that giving a better message to the user would be better. I'm willing to beta test if you want.
@doutriaux1 -- Sounds good. I'm working on a prototype and will let you know when you can give it a shot.
@FrankD412 as an FYI I seem t be getting a similar behavior when using:
type: local_parallel
But I'm not 100% this is an officially supported feature
@doutriaux1 -- that's not currently a supported adapter in the current version. Are you referring to my fork?
@FrankD412 I've seen it in another study and was trying out becauseI thought it was already back in the official repo. That explains why it confuses the status.
@doutriaux1 -- Oh got it. Yeah, that's a different prototype for running a conductor locally in an allocation. I definitely aliased the name in my fork. Sorry about the confusion.
I have a study that starts a bunch of python jobs. the
maestro run
part went fine. Then I issue amaestro status
and it tells me something probably went wrong. Eventually I realized some of my jobs were actually running just slow.Also the study has 2 sets of jobs that it can start independently at the beginning. But somehow the second set is not even generatting the command lines script, apparently waiting the the first set of jobs to finish
Screen output
ps command that shows jobs are actually running
yaml file
genrator py file