job-list: support job history cache size configuration

chu11 commented 3 years ago

As discussed in #3583 the job-list module can impact job throughput and general performance as it stores the history of all jobs without ever freeing them, thus using more and more memory. In one test, the job-list module alone was responsible for the broker to use twice as much memory.

There has been long term discussion of how to deal with this, such as archiving jobs, and perhaps having job-list access archived jobs for the user.

However short term, we should support a configuration that allows the user to limit the amount of job history the job-list module stores, so users can control this and its potential effect on performance. Such a configuration shouldn't matter much for the long term, b/c access of archived jobs should be independent of how much job history we're willing to cache.

In #3583 I showed an experimental patch I used to show performance impact.

diff --git a/src/modules/job-list/job_state.c b/src/modules/job-list/job_state.c
index fb51207..0ec1040 100644
--- a/src/modules/job-list/job_state.c
+++ b/src/modules/job-list/job_state.c
@@ -1093,6 +1093,12 @@ static int depthfirst_map_one (struct list_ctx *ctx, const char *key,
     }
     job_insert_list (ctx->jsctx, job, job->state);

+    if (zhashx_size (ctx->jsctx->index) > 10000) {
+        struct job *ptr = NULL;
+        zlistx_last (ctx->jsctx->inactive);
+        ptr = zlistx_detach_cur (ctx->jsctx->inactive);
+        zhashx_delete (ctx->jsctx->index, &ptr->id);
+    }
     rc = 1;
 done:
     if (rc < 0)
@@ -1310,6 +1316,12 @@ static int journal_submit_event (struct job_state_ctx *jsctx,
             errno = ENOMEM;
             return -1;
         }
+        if (zhashx_size (jsctx->index) > 10000) {
+            struct job *ptr = NULL;
+            zlistx_last (jsctx->inactive);
+            ptr = zlistx_detach_cur (jsctx->inactive);
+            zhashx_delete (jsctx->index, &ptr->id);
+        }
     }

     if (job_update_eventlog_seq (jsctx, job, eventlog_seq) == 1)

I believe something not so dissimilar to this patch should work (obviously should be cleaned up, errors handled, and the 10000 would be configurable, Edit: and it would be against the inactive list, not the main index).

garlick commented 3 years ago

Any other fallout besides reducing how much history flux jobs -a can display?

Since it looks like flux jobs -a defaults to --count=1000 anyway. Would it make sense to make the default that low?

chu11 commented 3 years ago

Any other fallout besides reducing how much history flux jobs -a can display?

The other fallout I have seen so far would be the flux module stats output, where the number of inactive jobs used to be "all jobs", but now it would be number of jobs currently cached. We could of course manually track the number of inactive jobs there have been if we want to maintain how this stat is currently used.

Since it looks like flux jobs -a defaults to --count=1000 anyway. Would it make sense to make the default that low?

I was going to experiment a bit, see where the throughput dropoff begins to start. Which I imagine will be somewhere in the 1024 - 4096 range, so somewhere around there.

grondo commented 3 years ago

This will affect the results returned by the flux jobs --filter option.

For example, if I run 10K jobs and then check for failures with flux jobs -a --filter=failed, the resulting list may be empty if the failed jobs were dropped from the job-list "cache". That is, when filtering jobs the job-list module may have to go all the way back to the first job to provide a "correct" answer.

Edit: Because this problem affects the implied "correctness" of flux jobs output, I would say that any cache size limit for job-list should be opt-in only until data for old jobs can be fetched somehow, either by faulting back in from the kvs or by switching to use of a database.

chu11 commented 3 years ago

Edit: Because this problem affects the implied "correctness" of flux jobs output, I would say that any cache size limit for job-list should be opt-in only until data for old jobs can be fetched somehow, either by faulting back in from the kvs or by switching to use of a database.

Ahh, good idea.

garlick commented 3 years ago

Ah right, forgot about that. Would be nice if job-archive provided the answers to that sort of query eventually.

chu11 commented 3 years ago

Ah right, forgot about that. Would be nice if job-archive provided the answers to that sort of query eventually.

After fixing this, I'll start up a new issue to discuss that. We've discussed it on and off for awhile, perhaps time to discuss it more earnestly.

chu11 commented 3 years ago

ugh, so there were a few tricky cases I did not think about before embarking on this:

1) when you re-init the job-list module from the KVS, the jobs in the KVS aren't sorted in the way job-list cares. You need to complete the re-initialization from the KVS, sort the jobs, then chop off the older ones if a cache size was set. This is annoying but easily fixable w/ some flags.

2) the far trickier case is that after re-initing the job-list module from the KVS, the job-manager will replay the cached journal. So jobs that have been recently "evicted" may suddenly be seen again via the journal and there's no way to know if it's old or new.

A simple but perhaps not pretty solution is to keep track of all jobs that have been removed due to caching evictions. Storing just a job id in another hash is way less memory than storing all the presently cached job information.

flux-framework / flux-core

job-list: support job history cache size configuration #3607