Job State Tracking Methods

Tracking changes in job state is a powerful feature, but must be used responsibly in order to work effectively. At present, the specification refers to polling by calling batch queue status APIs frequently, and does not talk about how to manage Psi/J's own internal status information on submitted jobs.

The result is that Psi/J uses an unknown/uncontrolled amount of system resources and has no mechanism for tracking, using, and managing its own stored state about submitted/completed jobs (say on shutdown/startup).

Describe the solution you'd like

The spec should clearly spell out:

Under what function calls / state conditions does Psi/J poll the batch queue?
How can polling be transformed into callbacks from the batch queue itself (when supported, e.g. https://slurm.schedmd.com/strigger.html)?
Where does Psi/J track its internal information about submitted jobs?
How can that information be queried programmatically (e.g. on shutdown/startup of a Psi/J executor)?

Hi David and thank you for the suggestion.

I'll let @andre-merzky weigh in, but the specification is purposefully light in prescribing what an implementation should do except to strongly suggest that it should batch up requests in the background as much as possible.

The specification is mainly focused on the user side of things. Its goal is to provide a familiar model for multiple implementations, both within the same programming language and across languages and gives quite some liberty to implementations to decide how to best do what they need to do, since it may be hard to come up with the "one true solution" to many of the problems encountered.

I will try to address some of the issues you mention here. I also do not fully understand everything, so I will try to ask for clarification.

[...] does not talk about how to manage Psi/J's own internal status information on submitted jobs.

I am unclear about what this means. It is my understanding that there should be a separation between the external (exposed to users) API and whatever happens internally in an implementation. As a consequence, internal status information is, by design, not exposed in the public API. But I am not sure I fully grasp what your sentence means and it might be helpful if you could give us some more details.

has no mechanism for tracking, using, and managing its own stored state about submitted/completed jobs (say on shutdown/startup).

Again, I'm not entirely sure what you mean here. The local layer of PSI/J is meant to be used from a persistent process and has limited support for managing jobs from multiple disconnected processes. It does, however, provide a mechanism to retrieve and re-connect a job to a "native job id" (e.g., a SLURM job id), so that can be used to, say, submit jobs from one process and monitor them from another.

Under what function calls / state conditions does Psi/J poll the batch queue?

I understand these are suggestions rather than questions, but I believe it may be more appropriate to discuss them a bit. As above, that is left as an implementation detail. The reference python implementation (psij-python) uses a configurable periodic poll period to query for the status of all managed jobs (some of the details are spelled out in https://exaworks.org/psij-python/docs/v/0.9.0/.generated/psij.executors.batch.html#module-psij.executors.batch.batch_scheduler_executor). Please note, however, that not all executors are batch scheduler executors and some batch scheduler executors can be implemented in ways that don't rely on qsub/qstat/qdel or equivalent.

How can polling be transformed into callbacks from the batch queue itself (when supported, e.g. https://slurm.schedmd.com/strigger.html)?

This is meant to be done internally by the implementation, and it seems that we should clarify this in the specification. To be clear, the user simply adds a callback for a job (or an executor), and the implementation takes care of polling the queue and invoking the callback as needed.

Where does Psi/J track its internal information about submitted jobs?

This is implementation specific, and, even within a single implementation, it can be executor specific. In our psij-python implementation, the batch scheduler executors keep a list of active jobs. Other executors may do other things. None of it is meant to be exposed to the user.

How can that information be queried programmatically (e.g. on shutdown/startup of a Psi/J executor)?

The one thing that can be queried is job.native_id. In the case of scheduler jobs, the ID will be the scheduler job ID. For the local executor, it will be the PID of the child process. For other executors it may be something else. However, all executors must make a good faith effort to give you status updates for that job if you use executor.attach(job, native_id). Please note, however, that executors are, for reasonable observable purposes, stateless entities. While they may maintain some state, it is only for the purpose of optimizing things (such as reducing the number of batch queue queries). So there is no shutdown of an executor and there is no particular startup of an executor beyond getting an instance of it. Please also note that executor instances may (or may not) share such invisible internal state. For example, all batch queue executors in psij-python of the same kind (e.g., all SLURM executors) share a single queue polling thread.

I'm moving the conversation around a bit, since it looks like there are two separate issues here: polling optimizations and job tracking.

The specification is mainly focused on the user side of things. Its goal is to provide a familiar model for multiple implementations, both within the same programming language and across languages and gives quite some liberty to implementations to decide how to best do what they need to do, since it may be hard to come up with the "one true solution" to many of the problems encountered.

I agree with this goal. However, as a user, I want to have some idea of the system resources used by my application. This is especially true when I consider running multiple instances of the api simultaneously.

How can polling be transformed into callbacks from the batch queue itself (when supported, e.g. https://slurm.schedmd.com/strigger.html)?

This is meant to be done internally by the implementation, and it seems that we should clarify this in the specification. To be clear, the user simply adds a callback for a job (or an executor), and the implementation takes care of polling the queue and invoking the callback as needed.

Perhaps this, and the question about resource utilization above, could be handled by a "supported features" call to the executor. Like querying the version of a package, this would allow executors to list optimizations they support.

[...] does not talk about how to manage Psi/J's own internal status information on submitted jobs. has no mechanism for tracking, using, and managing its own stored state about submitted/completed jobs (say on shutdown/startup).

I am unclear about what this means. It is my understanding that there should be a separation between the external (exposed to users) API and whatever happens internally in an implementation. As a consequence, internal status information is, by design, not exposed in the public API. But I am not sure I fully grasp what your sentence means and it might be helpful if you could give us some more details. The local layer of PSI/J is meant to be used from a persistent process and has limited support for managing jobs from multiple disconnected processes. It does, however, provide a mechanism to retrieve and re-connect a job to a "native job id" (e.g., a SLURM job id), so that can be used to, say, submit jobs from one process and monitor them from another.

Where does Psi/J track its internal information about submitted jobs?

This is implementation specific, and, even within a single implementation, it can be executor specific. In our psij-python implementation, the batch scheduler executors keep a list of active jobs. Other executors may do other things. None of it is meant to be exposed to the user.

How can that information be queried programmatically (e.g. on shutdown/startup of a Psi/J executor)?

The one thing that can be queried is job.native_id. In the case of scheduler jobs, the ID will be the scheduler job ID. For the local executor, it will be the PID of the child process. For other executors it may be something else. However, all executors must make a good faith effort to give you status updates for that job if you use executor.attach(job, native_id). Please note, however, that executors are, for reasonable observable purposes, stateless entities. While they may maintain some state, it is only for the purpose of optimizing things (such as reducing the number of batch queue queries). So there is no shutdown of an executor and there is no particular startup of an executor beyond getting an instance of it. Please also note that executor instances may (or may not) share such invisible internal state. For example, all batch queue executors in psij-python of the same kind (e.g., all SLURM executors) share a single queue polling thread.

OK. I had assumed that the job API should be able to produce a list of jobs it is tracking. Further, since Psi/J uses the $HOME/.psij directory, it is the natural place to save/restore such past job information.

In light of the response above, I would say that having some way to ask an executor to list its currently tracked jobs and to suspend/resume is a feature I would like to see. That means executors need to maintain explicit state - and expose to users how to print, save, and load that state.

This is an important feature because jobs often spend days in some states (e.g. sitting in the batch queue waiting to be executed or running). These are much longer than the typical log-in window remains open. Also, it may be useful to capture the state of jobs at a particular time or to operate the API in "read-only" or listening mode to monitor jobs without actively changing anything. These use cases could be supported easily by creating files with well-known names when jobs enter each state. Or, better, by having the job itself run a script with a well-known name (listing all accumulated call-backs) when entering each state.

I've created an implementation that removes polling and tracks all jobs launched (as well as replacing filesystem paths with scripts -- #175).

https://github.com/frobnitzem/psik

I tried to follow the api spec as much as possible, but had to make some changes to the jobspec data models. I think these are worth discussing as changes to the draft API specification.

I'm moving the conversation around a bit, since it looks like there are two separate issues here: polling optimizations and job tracking.

The specification is mainly focused on the user side of things. Its goal is to provide a familiar model for multiple implementations, both within the same programming language and across languages and gives quite some liberty to implementations to decide how to best do what they need to do, since it may be hard to come up with the "one true solution" to many of the problems encountered.

I agree with this goal. However, as a user, I want to have some idea of the system resources used by my application. This is especially true when I consider running multiple instances of the api simultaneously.

Indeed. However, as pointed out above, this is something that belongs to implementations. And there is good reason for this approach: implementations should be able to independently develop better solutions to the problem at hand. To be clear, the performance of an implementation is largely a separate issue from the API specification.

How can polling be transformed into callbacks from the batch queue itself (when supported, e.g. https://slurm.schedmd.com/strigger.html)?

This is meant to be done internally by the implementation, and it seems that we should clarify this in the specification. To be clear, the user simply adds a callback for a job (or an executor), and the implementation takes care of polling the queue and invoking the callback as needed.

Perhaps this, and the question about resource utilization above, could be handled by a "supported features" call to the executor. Like querying the version of a package, this would allow executors to list optimizations they support.

This seems to me like a choice that one would make when picking a PSI/J implementation rather than programatically, at run-time, using an optimization list. There seem to be two further problems with listing optimizations, one being that it is a somewhat general problem that could be applied to any program hence somewhat outside the scope of PSI/J and the second being that it's not quite clear that one can define a formal model for all possible or reasonable optimizations that an implementation could support.

[...]

OK. I had assumed that the job API should be able to produce a list of jobs it is tracking. Further, since Psi/J uses the $HOME/.psij directory, it is the natural place to save/restore such past job information.

In light of the response above, I would say that having some way to ask an executor to list its currently tracked jobs and to suspend/resume is a feature I would like to see. That means executors need to maintain explicit state - and expose to users how to print, save, and load that state.

It is entirely possible to add a callback to an executor that does precisely that kind of accounting without having to add to the specification or the base implementation.

This is an important feature because jobs often spend days in some states (e.g. sitting in the batch queue waiting to be executed or running). These are much longer than the typical log-in window remains open. Also, it may be useful to capture the state of jobs at a particular time or to operate the API in "read-only" or listening mode to monitor jobs without actively changing anything. These use cases could be supported easily by creating files with well-known names when jobs enter each state. Or, better, by having the job itself run a script with a well-known name (listing all accumulated call-backs) when entering each state.

There are a number of ways that one could approach this problem, including saving job history to a database or file, or by using a service in the sense of the PSI/J remote layer or equivalent. While we haven't added the remote layer to the specification, which mostly should consists of ways to specify authentication credentials, these scenarios are supported by the specification as it is.

ExaWorks / job-api-spec

Job State Tracking Methods #173