Closed grondo closed 2 months ago
I restarted flux, which was in a good state until a queue was started, then we got the same results as above. This was noted in the logs:
[ +33.929493] job-manager[0]: sched.alloc-response: id=f24ehES5k3J7 already allocated
[ +33.929519] job-manager[0]: alloc: stop due to alloc response error: File exists
After canceling the pending affected job and restarting, the same issue occurred, just with a different job in the queue. Another cancel and restart confirmed this.
This seemed to only affect the jobs in one queue. Starting other queues individually had jobs running without the scheduler alloc-response error here.
One possible improvement to avoid the negative alloc pending count would be to decrement the sent_count after the check for a job that already has resources:
diff --git a/src/modules/job-manager/alloc.c b/src/modules/job-manager/alloc.c
index 58cbe0c12..9ba82007f 100644
--- a/src/modules/job-manager/alloc.c
+++ b/src/modules/job-manager/alloc.c
@@ -182,7 +182,6 @@ static void alloc_response_cb (flux_t *h,
goto teardown;
}
(void)json_object_del (R, "scheduling");
- alloc->sent_count--;
if (!job) {
(void)free_request (alloc, id, R);
@@ -200,6 +199,7 @@ static void alloc_response_cb (flux_t *h,
errno = EEXIST;
goto teardown;
}
+ alloc->sent_count--;
job->R_redacted = json_incref (R);
if (annotations_update_and_publish (ctx, job, annotations) < 0)
flux_log_error (h, "annotations_update: id=%s", idf58 (id));
Probably fixed by #6076. Reopen if we see a problem.
After a restart of Flux on elcap:
Even after removing the fluxion modules, the
alloc requests pending
count is still negative, though the queued count went down to zero.