flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
90 stars 41 forks source link

qmanager spams annotation updates for all jobs #1249

Closed grondo closed 4 months ago

grondo commented 4 months ago

As noted in flux-framework/flux-core#6114, we recently discovered that a possible contributing factor to rank 0 broker RSS growth is massive amounts of annotation updates from Fluxion for jobs. Speaking with @trws and @milroy earlier today, it appeared that they discovered a possible bug in qmanager that could send unnecessary annotation updates for all jobs on each check callback (so on each event loop iteration).

garlick commented 4 months ago

Yep, I think you found it. Try this:

diff --git a/src/modules/sched-simple/sched.c b/src/modules/sched-simple/sched.c
index 0f08af251..614206d42 100644
--- a/src/modules/sched-simple/sched.c
+++ b/src/modules/sched-simple/sched.c
@@ -292,9 +292,6 @@ static void annotate_reason_pending (struct simple_sched *ss)
 {
     int jobs_ahead = 0;

-    if (!flux_module_debug_test (ss->h, DEBUG_ANNOTATE_REASON_PENDING, false))
-        return;
-
     struct jobreq *job = zlistx_first (ss->queue);
     while (job) {
         if (schedutil_alloc_respond_annotate_pack (ss->util_ctx,
@@ -326,17 +323,17 @@ static void check_cb (flux_reactor_t *r, flux_watcher_t *w,
                       int revents, void *arg)
 {
     struct simple_sched *ss = arg;
-    flux_watcher_stop (ss->idle);
+    //flux_watcher_stop (ss->idle);

     /* See if we can fulfill alloc for a pending job
      * If current head of queue can't be allocated, stop the prep
      *  watcher, i.e. block. O/w, retry on next loop.
      */
     if (try_alloc (ss->h, ss) < 0 && errno == ENOSPC) {
-        annotate_reason_pending (ss);
-        flux_watcher_stop (ss->prep);
-        flux_watcher_stop (ss->check);
+        //flux_watcher_stop (ss->prep);
+        //flux_watcher_stop (ss->check);
     }
+    annotate_reason_pending (ss);
 }

 static int try_free (flux_t *h, struct simple_sched *ss, json_t *R)

then

flux start
flux resource drain 0
flux submit hostname

watch the RSS

garlick commented 4 months ago

Oh just saw in the other bug that you made more progress. Sorry for the noise!

grondo commented 4 months ago

No, this is good, thanks!

I think we're fairly confident this is the source of the RSS growth in the rank 0 broker on elcapi. Since the broker was actually OOM killed twice in one day on that system, this is a pretty high priority fix. @milroy or @trws, if you have a band-aid patch, post it here and I or @jameshcorbett can patch the RPM and we can get this fix onto the system asap.

grondo commented 4 months ago

Note that there is still something to fix in flux-core as noted in flux-framework/flux-core#6114. We should not be storing all annotation updates in memory for perpetuity.

milroy commented 4 months ago

This patch should prevent the extra annotation updates:

diff --git a/qmanager/modules/qmanager_callbacks.cpp b/qmanager/modules/qmanager_callbacks.cpp
index 45bcbf0a..b3823fd3 100644
--- a/qmanager/modules/qmanager_callbacks.cpp
+++ b/qmanager/modules/qmanager_callbacks.cpp
@@ -97,6 +97,7 @@ int qmanager_cb_t::post_sched_loop (
             // if old_at == at, then no reason to send this annotation again.
             if (job->schedule.at == job->schedule.old_at)
                 continue;
+            job->schedule.old_at = job->schedule.at;
             if (schedutil_alloc_respond_annotate_pack (schedutil,
                                                        job->msg,
                                                        "{ s:{s:f} }",

(Edited to remove unnecessary assignment.) @trws is more familiar with this code, though.

grondo commented 4 months ago

Thanks, once we get @trws ACK we can patch the current RPM and issue a new release (-2). @jameshcorbett, if you're not available to do this I can take care of it.

milroy commented 4 months ago

Also, the credit for the above patch goes to @trws since he recognized the fix and proposed the change.

grondo commented 4 months ago

Another idea would be to just comment out the annotation for now (they aren't critical and this would workaround the RSS growth until a proper fix is available.)

jameshcorbett commented 4 months ago

Thanks, once we get @trws ACK we can patch the current RPM and issue a new release (-2). @jameshcorbett, if you're not available to do this I can take care of it.

I'm around and available to do it!

milroy commented 4 months ago

Another idea would be to just comment out the annotation for now

I just ran that through the testsuite and it will also work.

milroy commented 4 months ago

Update: the testsuite fails the annotation test when the annotation response is commented out, but I don't think that will impact necessary functionality. It's probably safer to just use the patch above.