Retrieving the status of a submitted batch job returns `NEW` unless the user waits a few seconds

ExaWorks / psij-python

MIT License

27 stars 13 forks source link

Retrieving the status of a submitted batch job returns `NEW` unless the user waits a few seconds #399

Open Andrew-S-Rosen opened 1 year ago

Andrew-S-Rosen commented 1 year ago

I tried submitting a Slurm job "manually" and getting the ID. Using this native ID, I used the following code to get the job state.

from psij import Job, JobExecutor

job_executor = JobExecutor.get_instance("slurm")

job = Job()

job_executor.attach(job, "123456") # placeholder native ID obtained from `sbatch`
print(job.status.state)

When doing this within the first ~2-3 seconds after the job was submitted, I get back NEW even though the job is marked Q in the queue (and is submitted because otherwise I wouldn't have had the native ID). If I add a sleep timer of 4 seconds, it returns QUEUED every time as expected, but I'm worried that this might not be a general solution because if the filesystem is slow it might change that.

Here is a complete demonstration that works on Perlmutter:

from psij import Job, JobAttributes, JobExecutor, JobSpec, ResourceSpecV1
import time

job_executor = JobExecutor.get_instance("slurm")
job = Job(
    JobSpec(
        name="test",
        executable="/bin/date",
        resources=ResourceSpecV1(node_count = 1),
        attributes=JobAttributes(project_name = 'MyAccountName', custom_attributes = {'slurm.constraint': 'cpu', 'slurm.qos': 'debug'}),
        launcher="single",
    )
)
job_executor.submit(job)
native_id = job.native_id
print(native_id) #---> prints the Slurm ID correctly

job_executor = JobExecutor.get_instance("slurm")
job = Job()
job_executor.attach(job, native_id)
print(job.status.state) #---> prints NEW
time.sleep(4)
print(job.status.state) #---> prints ACTIVE

Is this the expected behavior, or is this something to be addressed? I, naturally, get the same behavior when using a separate Python process that uses PSI/J to submit the Slurm job.

Sidenote: This feature of retrieving the job state should also be added to the documentation somewhere.

hategan commented 1 year ago

This is... a known issue that we are working on.

Invoking too many qstats in a short period of time can overwhelm the queuing system. So we do one every n seconds¹ for all jobs tracked by an executor (actually, there's a single thread for all executor instances of the same kind in a process). The wait() basically delays things until that happens. But if you want to be sure you get a non-NEW status, you have to wait at least BatchSchedulerExecutorConfig.initial_queue_polling_delay seconds.

If you can do it, using notifications is the way to go. If you really have to use attach(), you should wait until the job is not in a NEW state (either using wait() or callbacks) or, alternatively, long enough as above.

The current situation is not particularly nice and there is more demand for attach() than we anticipated. So one proposal (as of a few days ago, and which we haven't yet implemented) is to decree that attach() should only return once a non-NEW state is obtained from the scheduler. The complexity is that we want to still be mindful of the scheduler and try to bulk calls to the respective qstat if many attach() requests are happening in a short period of time from the same process.

If it's really attach() calls from different processes that need to happen, the above behavior will still apply (i.e., attach() will only return after a current state is obtained), but it's unreasonably difficult to make separate processes communicate in a way that bulks sufficiently close-in-time calls. So the only thing left to do is warn users in the documentation that things don't scale well.

¹ I lie a bit. There's an initial delay (2s default) and a separate polling interval (30s by default).

Andrew-S-Rosen commented 1 year ago

Thanks again, @hategan!! Very clear response. I greatly appreciate it!

If you can do it, using notifications is the way to go. If you really have to use attach(), you should wait until the job is not in a NEW state (either using wait() or callbacks) or, alternatively, long enough as above.

Won't calling .wait() block until the job finishes? If my understanding is correct, that would be problematic for my use case because I'd like to be able to retrieve a QUEUED state, which seems like it'd be ignored while the blocking happens. Either way, attach() is indeed the only option for me since I'm doing things in an async way where I'm SSHing into the remote machine every polling interval to check the job state (at least until PSI/J has built-in support for this kind of thing 😉).

So one proposal (as of a few days ago, and which we haven't yet implemented) is to decree that attach() should only return once a non-NEW state is obtained from the scheduler. The complexity is that we want to still be mindful of the scheduler and try to bulk calls to the respective qstat if many attach() requests are happening in a short period of time from the same process.

That would certainly be welcome! Of course, I understand the challenge though with not just bombarding the system with qstat calls.

Regardless of the ultimate plan here, this is super helpful because now I know that this delay period is intentional, so I can always be sure that if I .sleep() for greater than that, I should be okay!

Andrew-S-Rosen commented 1 year ago

@hategan: This is perhaps worthy of a separate discussion, but it seems that there is currently no way to get the status of a job (via .attach) after it is completed and no longer in the squeue system. That doesn't mean the job status can't be accessed because, at least with Slurm, you can instead use sacct -j to get the historical job data. So, perhaps in the future a fallback option could be added if the job isn't found via the primary search, or maybe it makes sense to always use sacct instead of squeue. Something to think about...

hategan commented 1 year ago

[...]

If you can do it, using notifications is the way to go. If you really have to use attach(), you should wait until the job is not in a NEW state (either using wait() or callbacks) or, alternatively, long enough as above.

Won't calling .wait() block until the job finishes?

In principle, wait(target_states=[JobState.QUEUED]) should wait until the job is in a QUEUED state or any stated that follows QUEUED. The spec mentions this correctly, but the API docs don't, so we need to update that.

If my understanding is correct, that would be problematic for my use case because I'd like to be able to retrieve a QUEUED state, which seems like it'd be ignored while the blocking happens.

That was indeed the case earlier until it became clear that we're forcing a race condition, so the behaviour above is now in effect. In other words, wait(target_states=[JobState.QUEUED]) should wait until anything after NEW. You could also list all desired target states explicitly, but that shouldn't be necessary.

Either way, attach() is indeed the only option for me since I'm doing things in an async way where I'm SSHing into the remote machine every polling interval to check the job state (at least until PSI/J has built-in support for this kind of thing 😉).

Yes, remote PSI/J should address this.

[...]

Andrew-S-Rosen commented 1 year ago

Awesome, thanks for the tip! Didn't know about the target_states kwarg in wait!

Very much looking forward to remote PSI/J!!

hategan commented 1 year ago

@hategan: This is perhaps worthy of a separate discussion, but it seems that there is currently no way to get the status of a job (via .attach) after it is completed and no longer in the squeue system. That doesn't mean the job status can't be accessed because, at least with Slurm, you can instead use sacct -j to get the historical job data. So, perhaps in the future a fallback option could be added if the job isn't found via the primary search, or maybe it makes sense to always use sacct instead of squeue. Something to think about...

In general, that's the problem with synchronous status: you can miss states and some of the mechanisms that PSI/J uses to detect exit code and other conditions are lost with attach(). Using sacct may work in some cases, but it depends, in my understanding, on how Slurm is configured. So it would be reaching a threshold of "it may provide some limited benefit in a situation that is needed because we're trying to do remote management without having a proper remote management solution and it only works for slurm and it comes at too high a cost in terms of complexity/maintenance for the benefit it provides".

But yes, something to think about...

Andrew-S-Rosen commented 1 year ago

Using sacct may work in some cases, but it depends, in my understanding, on how Slurm is configured. So it would be reaching a threshold of "it may provide some limited benefit in a situation that is needed because we're trying to do remote management without having a proper remote management solution and it only works for slurm and it comes at too high a cost in terms of complexity/maintenance for the benefit it provides".

Agree wholeheartedly. And I believe your understanding is correct; the purge time for the queue history can be modified by the administrator, which means the utility of this approach could be incredibly variable. I agree it's probably not a sustainable solution unless sacct were just used instead of squeue altogether, but it's probably not worth that hassle.

Once remote PSI/J exists and is out in the world though, I imagine there'll be much less need for .attach (at least for me!).

Thanks again!

hategan commented 1 year ago

So I started updating the specification based on an earlier discussion we had. The idea was to make the attach() call block until the status of the job is updated at least once. while doing some voodoo in the background to ensure that attach() doesn't block for too long.

But then I realized a few things:

The implementation of the new attach() will essentially be the recommendation above (i.e., add wait(QUEUED) at the end of the implementation of attach()).
When attach() is used in a single process, it mostly makes sense to use it to recover from a crash, in which case one assumes that a number of jobs that were active before the crash are re-attached and the normal use (callbacks) resumes. Making attach() wait would hurt in that scenario because job.status is not used.
When attach() is used from multiple processes, such as in the scenario that triggered this issue, the waiting behavior is indeed useful, but the scenario represents transitional use until remote PSI/J is available.
The other scenario in which a waiting attach() is useful is when attach() is being invoked from a single process to get repeated status updates for multiple jobs. In this case, expediting status updates leads to a circumvention of the default queue polling interval and reasonable design would dictate that a new time parameter be introduced to regulate the intervals at which attach() calls trigger queue polls. However, psij-python provides an alternative (callbacks) that can be employed for this use case, which is friendlier to the queuing systems.

In short, a waiting attach() hurts the use case that has no choice but to use attach() (2), helps a temporary use case that also has not choice but to use it (3), and helps a use case where better choices exist already (4). There is an alternative, which is to document the use of wait(QUEUED) following attach(). Under the assumption that the documentation of attach() is followed, this does not affect (2) and helps (3) and (4) about as much as a waiting attach() would.

It is also reasonable to assume that, with proper documentation, users almost always do the right thing, which would eliminate use case #4, but that does not significantly affect the conclusion.

So I'm inclined to address this with a documentation update for attach() in psij-python rather than a semantic update in the specification.

@andre-merzky, @arosen93: thoughts?

andre-merzky commented 1 year ago

Thanks @hategan !

I would be really hesitant to embrace a waiting attach. In the almost trivial usage of 'attach to n jobs, it would (in a trivial implementation) immediately lead to approx. n * pull_interval wait times:

for native_id in sys.argv:
    job = psij.Job()
    ex.attach(job)

When the attach for the first job start it will have to wait until the executor pulls for the state update. Then it takes a fraction of a second to loop around and attach the second job. At that point the executor just completed the state pull and will have to wait a full pull_interval (or whatever the constant is named) for the second attach to complete, etc. A slow attach is thus not a viable option IMHO. If at all it could be an optional behavior (ex.attach(job, wait=psij.QUEUED)).

I agree with a documentation fix being the more appropriate solution.

Andrew-S-Rosen commented 1 year ago

@hategan: Thanks for the writeup! I also agree that the documentation fix is probably the most viable solution here.

That said, one small comment.

There is an alternative, which is to document the use of wait(QUEUED) following attach()

Based on my reading of the code, it seems that adding QUEUED as the target_state is probably not sufficient if the .attach method is done in a separate Python process from where the original job was submitted. Depending on how often the status is checked, one might miss the QUEUED state entirely in which case it'd be waiting indefinitely, no? Wouldn't something like the following be necessary instead to ensure that all states after QUEUED would also be picked up and unblock?

job.wait(
    target_states=[
        JobState.QUEUED,
        JobState.CANCELED,
        JobState.ACTIVE,
        JobState.FAILED,
        JobState.COMPLETED,
    ]
)

My assumption is based on this code block where the JobState is strictly checked:

https://github.com/ExaWorks/psij-python/blob/331d446103eccfd447b21da152e81e479b2a6520/src/psij/job.py#L193-L199

Of course, as we noted earlier, it's possible that the wait could still go on indefinitely if the job ID is no longer even accessible by qstat/squeue, but that's just an inherent limitation of doing this kind of approach (and the user can specify timeout if they needed).

Regardless, this is a minor detail. The key thing is simply to highlight that this kind of user-defined wait mechanism is possible.

hategan commented 1 year ago

Based on my reading of the code, it seems that adding QUEUED as the target_state is probably not sufficient

You are correct. It turns out that psij-python does not correctly implement the specification, which basically says that wait(QUEUED) should wait until it can be asserted that the job is or was in a QUEUED state. I've filed an issue for this (#400).

I'm also seeing that the ordering of states is broken even in the specification, since FAILED > ACTIVE, but then the description of wait() explicitly states that you can't assume the job was ever ACTIVE if it failed (could have gone QUEUED to FAILED directly). I've submitted an issue for this too (https://github.com/ExaWorks/job-api-spec/issues/166).

If the job is not in the queue any more and PSI/J cannot find certain files related to it, it is supposed to be marked as COMPLETED. If said files are found, it might be able to distinguish between COMPLETED and FAILED based on exit code. To be clear, assuming the job actually terminates, wait() should NEVER block indefinitely.

Andrew-S-Rosen commented 1 year ago

Ah, okay! Glad we were able to put the pieces together and spot some things in need of an update. Thanks for your attention to detail there!

If the job is not in the queue any more and PSI/J cannot find certain files related to it, it is supposed to be marked as COMPLETED.

I'm not sure that's the behavior I currently see (although maybe it's related to the above issues). If I query for a job after it's done and gone from the queue, I get a logging error that the files can't be found but the state is still returned as NEW (when using .attach). I can provide more details and troubleshoot if that seems to be a new issue altogether.

Edit: I guess this is expected because there's no way for .attach to know if it's NEW or COMPLETED at all. You can likely disregard the above paragraph.

hategan commented 1 year ago

I'm not sure that's the behavior I currently see (although maybe it's related to the above issues). If I query for a job after it's done and gone from the queue, I get a logging error that the files can't be found but the state is still returned as NEW (when using .attach). I can provide more details and troubleshoot if that seems to be a new issue altogether.

Yes, please. I started #401 for this.

Edit: I guess this is expected because there's no way for .attach to know if it's NEW or COMPLETED at all. You can likely disregard the above paragraph.

If you have a native id, it made it to the queue at some point. If it's not in the queue any more, it either finished or failed at some point. So the code is supposed to assume that a missing job is COMPLETED (or failed if that level of detail is otherwise available).

Andrew-S-Rosen commented 1 year ago

Super useful to know! Thanks! I'll report more info in #401 later this afternoon when I can do some more thorough tests with reproducible examples.