Open Andrew-S-Rosen opened 1 year ago
This is... a known issue that we are working on.
Invoking too many qstats in a short period of time can overwhelm the queuing system. So we do one every n seconds1 for all jobs tracked by an executor (actually, there's a single thread for all executor instances of the same kind in a process). The wait()
basically delays things until that happens. But if you want to be sure you get a non-NEW status, you have to wait at least BatchSchedulerExecutorConfig.initial_queue_polling_delay
seconds.
If you can do it, using notifications is the way to go. If you really have to use attach()
, you should wait until the job is not in a NEW state (either using wait()
or callbacks) or, alternatively, long enough as above.
The current situation is not particularly nice and there is more demand for attach()
than we anticipated. So one proposal (as of a few days ago, and which we haven't yet implemented) is to decree that attach()
should only return once a non-NEW state is obtained from the scheduler. The complexity is that we want to still be mindful of the scheduler and try to bulk calls to the respective qstat if many attach()
requests are happening in a short period of time from the same process.
If it's really attach()
calls from different processes that need to happen, the above behavior will still apply (i.e., attach()
will only return after a current state is obtained), but it's unreasonably difficult to make separate processes communicate in a way that bulks sufficiently close-in-time calls. So the only thing left to do is warn users in the documentation that things don't scale well.
1 I lie a bit. There's an initial delay (2s default) and a separate polling interval (30s by default).
Thanks again, @hategan!! Very clear response. I greatly appreciate it!
If you can do it, using notifications is the way to go. If you really have to use attach(), you should wait until the job is not in a NEW state (either using wait() or callbacks) or, alternatively, long enough as above.
Won't calling .wait()
block until the job finishes? If my understanding is correct, that would be problematic for my use case because I'd like to be able to retrieve a QUEUED
state, which seems like it'd be ignored while the blocking happens. Either way, attach()
is indeed the only option for me since I'm doing things in an async way where I'm SSHing into the remote machine every polling interval to check the job state (at least until PSI/J has built-in support for this kind of thing 😉).
So one proposal (as of a few days ago, and which we haven't yet implemented) is to decree that attach() should only return once a non-NEW state is obtained from the scheduler. The complexity is that we want to still be mindful of the scheduler and try to bulk calls to the respective qstat if many attach() requests are happening in a short period of time from the same process.
That would certainly be welcome! Of course, I understand the challenge though with not just bombarding the system with qstat calls.
Regardless of the ultimate plan here, this is super helpful because now I know that this delay period is intentional, so I can always be sure that if I .sleep()
for greater than that, I should be okay!
@hategan: This is perhaps worthy of a separate discussion, but it seems that there is currently no way to get the status of a job (via .attach
) after it is completed and no longer in the squeue
system. That doesn't mean the job status can't be accessed because, at least with Slurm, you can instead use sacct -j
to get the historical job data. So, perhaps in the future a fallback option could be added if the job isn't found via the primary search, or maybe it makes sense to always use sacct
instead of squeue
. Something to think about...
[...]
If you can do it, using notifications is the way to go. If you really have to use attach(), you should wait until the job is not in a NEW state (either using wait() or callbacks) or, alternatively, long enough as above.
Won't calling
.wait()
block until the job finishes?
In principle, wait(target_states=[JobState.QUEUED])
should wait until the job is in a QUEUED
state or any stated that follows QUEUED
. The spec mentions this correctly, but the API docs don't, so we need to update that.
If my understanding is correct, that would be problematic for my use case because I'd like to be able to retrieve a
QUEUED
state, which seems like it'd be ignored while the blocking happens.
That was indeed the case earlier until it became clear that we're forcing a race condition, so the behaviour above is now in effect. In other words, wait(target_states=[JobState.QUEUED])
should wait until anything after NEW
. You could also list all desired target states explicitly, but that shouldn't be necessary.
Either way,
attach()
is indeed the only option for me since I'm doing things in an async way where I'm SSHing into the remote machine every polling interval to check the job state (at least until PSI/J has built-in support for this kind of thing 😉).
Yes, remote PSI/J should address this.
[...]
Awesome, thanks for the tip! Didn't know about the target_states
kwarg in wait
!
Very much looking forward to remote PSI/J!!
@hategan: This is perhaps worthy of a separate discussion, but it seems that there is currently no way to get the status of a job (via
.attach
) after it is completed and no longer in thesqueue
system. That doesn't mean the job status can't be accessed because, at least with Slurm, you can instead usesacct -j
to get the historical job data. So, perhaps in the future a fallback option could be added if the job isn't found via the primary search, or maybe it makes sense to always usesacct
instead ofsqueue
. Something to think about...
In general, that's the problem with synchronous status: you can miss states and some of the mechanisms that PSI/J uses to detect exit code and other conditions are lost with attach()
. Using sacct
may work in some cases, but it depends, in my understanding, on how Slurm is configured. So it would be reaching a threshold of "it may provide some limited benefit in a situation that is needed because we're trying to do remote management without having a proper remote management solution and it only works for slurm and it comes at too high a cost in terms of complexity/maintenance for the benefit it provides".
But yes, something to think about...
Using sacct may work in some cases, but it depends, in my understanding, on how Slurm is configured. So it would be reaching a threshold of "it may provide some limited benefit in a situation that is needed because we're trying to do remote management without having a proper remote management solution and it only works for slurm and it comes at too high a cost in terms of complexity/maintenance for the benefit it provides".
Agree wholeheartedly. And I believe your understanding is correct; the purge time for the queue history can be modified by the administrator, which means the utility of this approach could be incredibly variable. I agree it's probably not a sustainable solution unless sacct
were just used instead of squeue
altogether, but it's probably not worth that hassle.
Once remote PSI/J exists and is out in the world though, I imagine there'll be much less need for .attach
(at least for me!).
Thanks again!
So I started updating the specification based on an earlier discussion we had. The idea was to make the attach()
call block until the status of the job is updated at least once. while doing some voodoo in the background to ensure that attach()
doesn't block for too long.
But then I realized a few things:
attach()
will essentially be the recommendation above (i.e., add wait(QUEUED)
at the end of the implementation of attach()
).attach()
is used in a single process, it mostly makes sense to use it to recover from a crash, in which case one assumes that a number of jobs that were active before the crash are re-attached and the normal use (callbacks) resumes. Making attach()
wait would hurt in that scenario because job.status
is not used.attach()
is used from multiple processes, such as in the scenario that triggered this issue, the waiting behavior is indeed useful, but the scenario represents transitional use until remote PSI/J is available.attach()
is useful is when attach()
is being invoked from a single process to get repeated status updates for multiple jobs. In this case, expediting status updates leads to a circumvention of the default queue polling interval and reasonable design would dictate that a new time parameter be introduced to regulate the intervals at which attach()
calls trigger queue polls. However, psij-python provides an alternative (callbacks) that can be employed for this use case, which is friendlier to the queuing systems.In short, a waiting attach()
hurts the use case that has no choice but to use attach()
(2), helps a temporary use case that also has not choice but to use it (3), and helps a use case where better choices exist already (4).
There is an alternative, which is to document the use of wait(QUEUED)
following attach()
. Under the assumption that the documentation of attach()
is followed, this does not affect (2) and helps (3) and (4) about as much as a waiting attach()
would.
It is also reasonable to assume that, with proper documentation, users almost always do the right thing, which would eliminate use case #4, but that does not significantly affect the conclusion.
So I'm inclined to address this with a documentation update for attach()
in psij-python rather than a semantic update in the specification.
@andre-merzky, @arosen93: thoughts?
Thanks @hategan !
I would be really hesitant to embrace a waiting attach. In the almost trivial usage of 'attach to n
jobs, it would (in a trivial implementation) immediately lead to approx. n * pull_interval
wait times:
for native_id in sys.argv:
job = psij.Job()
ex.attach(job)
When the attach for the first job start it will have to wait until the executor pulls for the state update. Then it takes a fraction of a second to loop around and attach the second job. At that point the executor just completed the state pull and will have to wait a full pull_interval
(or whatever the constant is named) for the second attach to complete, etc. A slow attach is thus not a viable option IMHO. If at all it could be an optional behavior (ex.attach(job, wait=psij.QUEUED)
).
I agree with a documentation fix being the more appropriate solution.
@hategan: Thanks for the writeup! I also agree that the documentation fix is probably the most viable solution here.
That said, one small comment.
There is an alternative, which is to document the use of wait(QUEUED) following attach()
Based on my reading of the code, it seems that adding QUEUED
as the target_state
is probably not sufficient if the .attach
method is done in a separate Python process from where the original job was submitted. Depending on how often the status is checked, one might miss the QUEUED
state entirely in which case it'd be waiting indefinitely, no? Wouldn't something like the following be necessary instead to ensure that all states after QUEUED
would also be picked up and unblock?
job.wait(
target_states=[
JobState.QUEUED,
JobState.CANCELED,
JobState.ACTIVE,
JobState.FAILED,
JobState.COMPLETED,
]
)
My assumption is based on this code block where the JobState
is strictly checked:
Of course, as we noted earlier, it's possible that the wait
could still go on indefinitely if the job ID is no longer even accessible by qstat
/squeue
, but that's just an inherent limitation of doing this kind of approach (and the user can specify timeout
if they needed).
Regardless, this is a minor detail. The key thing is simply to highlight that this kind of user-defined wait
mechanism is possible.
Based on my reading of the code, it seems that adding
QUEUED
as thetarget_state
is probably not sufficient
You are correct. It turns out that psij-python does not correctly implement the specification, which basically says that wait(QUEUED)
should wait until it can be asserted that the job is or was in a QUEUED state. I've filed an issue for this (#400).
I'm also seeing that the ordering of states is broken even in the specification, since FAILED > ACTIVE
, but then the description of wait()
explicitly states that you can't assume the job was ever ACTIVE
if it failed (could have gone QUEUED
to FAILED
directly). I've submitted an issue for this too (https://github.com/ExaWorks/job-api-spec/issues/166).
If the job is not in the queue any more and PSI/J cannot find certain files related to it, it is supposed to be marked as COMPLETED
. If said files are found, it might be able to distinguish between COMPLETED
and FAILED
based on exit code. To be clear, assuming the job actually terminates, wait()
should NEVER block indefinitely.
Ah, okay! Glad we were able to put the pieces together and spot some things in need of an update. Thanks for your attention to detail there!
If the job is not in the queue any more and PSI/J cannot find certain files related to it, it is supposed to be marked as COMPLETED.
I'm not sure that's the behavior I currently see (although maybe it's related to the above issues). If I query for a job after it's done and gone from the queue, I get a logging
error that the files can't be found but the state is still returned as NEW
(when using .attach
). I can provide more details and troubleshoot if that seems to be a new issue altogether.
Edit: I guess this is expected because there's no way for .attach
to know if it's NEW
or COMPLETED
at all. You can likely disregard the above paragraph.
I'm not sure that's the behavior I currently see (although maybe it's related to the above issues). If I query for a job after it's done and gone from the queue, I get a
logging
error that the files can't be found but the state is still returned asNEW
(when using.attach
). I can provide more details and troubleshoot if that seems to be a new issue altogether.
Yes, please. I started #401 for this.
Edit: I guess this is expected because there's no way for
.attach
to know if it'sNEW
orCOMPLETED
at all. You can likely disregard the above paragraph.
If you have a native id, it made it to the queue at some point. If it's not in the queue any more, it either finished or failed at some point. So the code is supposed to assume that a missing job is COMPLETED (or failed if that level of detail is otherwise available).
Super useful to know! Thanks! I'll report more info in #401 later this afternoon when I can do some more thorough tests with reproducible examples.
I tried submitting a Slurm job "manually" and getting the ID. Using this native ID, I used the following code to get the job state.
When doing this within the first ~2-3 seconds after the job was submitted, I get back
NEW
even though the job is markedQ
in the queue (and is submitted because otherwise I wouldn't have had the native ID). If I add a sleep timer of 4 seconds, it returnsQUEUED
every time as expected, but I'm worried that this might not be a general solution because if the filesystem is slow it might change that.Here is a complete demonstration that works on Perlmutter:
Is this the expected behavior, or is this something to be addressed? I, naturally, get the same behavior when using a separate Python process that uses PSI/J to submit the Slurm job.
Sidenote: This feature of retrieving the job state should also be added to the documentation somewhere.