Closed chaen closed 2 months ago
The RunningLimit
is only checked at matching time. getReplicasForJobs
is instead called when the job is optimized, so before the matching attempt (potentially quite some time before). The two things are not fully logically connected one to the other, so I would not mix them.
That can still create impossible job, so why closing the issue ? :thinking:
What you are pointing out is one of the (several) conditions through which we can create jobs that stay in "Waiting" status for potentially a long time, maybe "forever". At least 2 connected unavoidable cases:
RunningLimit
is set to 0, and further attempts of matching will fail.RunningLimit
is 0, with or without implementing your proposal.The list can go on, but story short there is no way to fully avoid creating jobs that will Wait for "long" time.
I also do not like much the getReplicasForJobs
checks.
One other possibility is reset jobs that have been in "Waiting" for long time (because conditions of e.g. the allowed replicas might have, in the meantime, changed -- that is why the JobWrapper
calls again getReplicasForJobs
). Would that be a bad idea? Did we by chance think at that in the past already? -- cc @atsareg
getReplicasForJobs
(in the API/DataManager) returns only replicas that are allowed for job, i.e. on disk, not on failover, etc. However, theRunningLimit
of a given site can be set to 0 for User jobs, making the job impossible to run. We could detect that in theDirac API
and return an error.