job-manager: underflow in alloc request to scheduler count

grondo commented 3 months ago

After a restart of Flux on elcap:

13 alloc requests queued
-9 alloc requests pending to scheduler

Even after removing the fluxion modules, the alloc requests pending count is still negative, though the queued count went down to zero.

grondo commented 3 months ago

I restarted flux, which was in a good state until a queue was started, then we got the same results as above. This was noted in the logs:

[ +33.929493] job-manager[0]: sched.alloc-response: id=f24ehES5k3J7 already allocated
[ +33.929519] job-manager[0]: alloc: stop due to alloc response error: File exists

grondo commented 3 months ago

After canceling the pending affected job and restarting, the same issue occurred, just with a different job in the queue. Another cancel and restart confirmed this.

grondo commented 3 months ago

This seemed to only affect the jobs in one queue. Starting other queues individually had jobs running without the scheduler alloc-response error here.

grondo commented 3 months ago

One possible improvement to avoid the negative alloc pending count would be to decrement the sent_count after the check for a job that already has resources:

diff --git a/src/modules/job-manager/alloc.c b/src/modules/job-manager/alloc.c
index 58cbe0c12..9ba82007f 100644
--- a/src/modules/job-manager/alloc.c
+++ b/src/modules/job-manager/alloc.c
@@ -182,7 +182,6 @@ static void alloc_response_cb (flux_t *h,
             goto teardown;
         }
         (void)json_object_del (R, "scheduling");
-        alloc->sent_count--;

         if (!job) {
             (void)free_request (alloc, id, R);
@@ -200,6 +199,7 @@ static void alloc_response_cb (flux_t *h,
             errno = EEXIST;
             goto teardown;
         }
+        alloc->sent_count--;
         job->R_redacted = json_incref (R);
         if (annotations_update_and_publish (ctx, job, annotations) < 0)
             flux_log_error (h, "annotations_update: id=%s", idf58 (id));

garlick commented 2 months ago

Probably fixed by #6076. Reopen if we see a problem.

flux-framework / flux-core

job-manager: underflow in alloc request to scheduler count #6059