ExaWorks / job-api-spec

https://exaworks.org/job-api-spec/
3 stars 3 forks source link

Native ID may not be available in a QUEUED state on local jobs #181

Open hategan opened 9 months ago

hategan commented 9 months ago

The addition of file staging complicates the native id setting in local-type executors. Whereas before, we could Popen a process and immediately set the QUEUED state with the native ID, with file staging we cannot, since files must be staged before the Popen. And since STAGE_IN > QUEUED, this leads to an impossibility if the staging is to be done in the same process as the one PSI/J runs in. Even now, in psij-python we cheat a bit, since we trigger the QUEUED state after Popen is called.

One possible solution would be to have a job management process that handles both the staging and execution. It would be the PID of this process that would become the native ID of the job. This adds complexity, since we now have to communicate state transitions to the main PSI/J process from the job management process.

Another possible solution, which leads to a simpler implementation but may complicate use, is to only guarantee that the native ID is set when the job is in an ACTIVE state. This would be a specification change.

andre-merzky commented 9 months ago

[removed stupid comment]

hategan commented 9 months ago

[removed stupid comment]

:)

I agree that the wording is a bit ambiguous and I think we meant that the native id will be available as soon as the job enters the queued state, since all executors in psij-python do that. Even if we decide to go with that option, better wording can't hurt.

andre-merzky commented 9 months ago

[removed stupid comment]

:)

I agree that the wording is a bit ambiguous and I think we meant that the native id will be available as soon as the job enters the queued state, since all executors in psij-python do that. Even if we decide to go with that option, better wording can't hurt.

Well, my comment was stupid because I am used to the staging happening before the QUEUED state, but that does not hold for PSIJ. IIRC, we assume that the batch system enacts the staging, and thus the job first gets QUEUED, right? But then the statement (jobID only available in QUEUED) still holds.

So yes, I agree with you I guess - a bit more careful wording in several places may be sufficient to address this.

hategan commented 9 months ago

I am used to the staging happening before the QUEUED state, but that does not hold for PSIJ. IIRC, we assume that the batch system enacts the staging, and thus the job first gets QUEUED, right? But then the statement (jobID only available in QUEUED) still holds.

Not for the local executor and possibly not for other executors that implement synthetic staging.

We had SUBMITTED in Swift instead of QUEUED. Slightly different meaning which allowed us to cleanly have SUBMITTED before STAGE_IN in all cases. But that quite clearly meant that SUBMITTED and having a native ID were disconnected. We didn't actually expose the native ID, so it didn't matter.

So yes, I agree with you I guess - a bit more careful wording in several places may be sufficient to address this.