implement prototype job-db module

garlick commented 2 years ago

Problem: the job-list module's memory cache grows without bound, and tracks KVS which we need to start pruning. The job-archive module is mainly structured to feed data to flux-accounting and offers no on-line query support.

We want to answer the question: is it feasible to develop a new module that answers job-list queries exactly as the job-list module does now, but instead of only caching jobs in memory, retains inactive job data in a persistent database?

Some notes from our discussions

It could be called job-db
It could register the job-list service so everything just works if it is loaded in place of job-list (even job-archive as is)
It could consume the job manager journal and maintain an in-memory cache of active jobs
Upon job transition to INACTIVE state, job data (now immutable) could be added to a sqlite database and dropped from the memory cache.
Queries would seamlessly access the cache as well as the database
Sqlite JSON capability may be of use for answering queries against R and jobspec without needing to break fields out to separate rows (see #4234)

The goal of prototyping here would be to quickly get something functional that we can discuss.

Automated testing could be deferred until we decide whether to move forward.

chu11 commented 2 years ago

some random thoughts, not necessarily for this prototype, but just thoughts for later (if this works out):

currently in job-archive we archive eventlog, jobspec, R, jobid, userid, ranks, and timestamps.
I believe userid, ranks, and timestamps can be retrieved via eventlog or R. We could make the decision to not bother to archive / store all "helper" / "convenience" fields for speed / space. this may be reasonable given #4234.
we do not currently archive guest.exec.eventlog or stdio. We probably never want to archive stdio, but maybe guest.exec.eventlog?
we don't currently archive annotations (update: and memos), which maybe we'd want to.

garlick commented 2 years ago

A suggestion would be to leave off guest.exec.eventlog, stdio, annotations, and memos for this experiment. We can always add later.

Also, a simplifying assumption may be that (unlike job-list) no kvs scan is needed on restart. I think we can make that work by adding some way to flush the job manager journal to consumers and end with an ENODATA during the broker CLEANUP state, before rc3 runs.

chu11 commented 2 years ago

I hit a snag during my prototyping today and thought I'd do a thought dump before moving forward:

the major snag: the job-list/job-db module can be sent the journal event for "job inactive" (eventlog "clean" event) BEFORE the eventlog "clean" event is written to the KVS. So there's a race condition here, which could lead to further complications down the road.
- e.g. i thought if I implemented a "inactive cache list", that might solve the problem, i.e. don't dump to archive until later once you know all the data is in the KVS. But if there was a system crash, we might lose inactive data b/c that data hasn't been flushed to the archive yet. But we gotta make sure data is flushed to the archive before we can remove data from the KVS, yadda yadda yadda.
- EDIT: I occurred to me, this race probably exists within the job-archive module right now. Just given the "period" and what not with job-archive, the probability is extremely low of hitting it. This is in contrast to the job-db prototype, where we get the job eventlog immediately after getting the journal entry saying we're inactive.

notes on technical things to resolve

will have to redo how we do job stats, probably not hard, anything related to the inactive job list has to be converted into a sql query probably (could be complicated with inactive cache list given above).
currently the job-archive errors out if no statedir is specified. how to deal with this with a job-db module? obviously don't want to error out and have no job-listing?? TBD.
it is quite annoying to "re-parse" eventlog / R / jobspec to "re-get" all job information for job-list. Would like to consider putting way more into the archive db for easier retrieval of data. Wish we had a nosql database we could just add columns as needed.

other thoughts / notes:

not 100% convinced we gotta make a new job-db module. archive could possibly be added directly into job-list.
maybe can nix some job-list stuff as a result (idsync? list-inactive?) TBD
memo / annotations, I think we discussed consolidation of this at some point in the past but I'm forgetting conversations. May be better to deal with this first.

overall:

although I haven't implemented something close to a working prototyping, I think this plan is doable. It's just going to take work than perhaps we originally realized.
as I was pondering solutions, one idea, would a "archive coordinator" module make sense. i.e. monitor what jobs are in the archive, tell job-list to flush cache entries it knows are in the archive, cleanup kvs inactive entries it knows are in the archive. not sure if it's a good idea, just a thought.

grondo commented 2 years ago

currently the job-archive errors out if no statedir is specified. how to deal with this with a job-db module? obviously don't want to error out and have no job-listing?? TBD.

I would have to ponder some of the other questions more, however, on this specific issue, sqlite has an in memory db mode which could be used if there is no statedir or if we want to default to no backing file.

Wish we had a nosql database we could just add columns as needed.

I'm definitely not an expert, but when researching JSON columns a lot of SO posts were describing how they use these columns as a cheap nosql database, or later add new columns/indexes on demand when necessary. Might want to do more research on that.

chu11 commented 2 years ago

the major snag: the job-list/job-db module can be sent the journal event for "job inactive" (eventlog "clean" event) BEFORE the eventlog "clean" event is written to the KVS. So there's a race condition here, which could lead to further complications down the road.

As an experiment, instead of reading the eventlog from the KVS I simply rebuilt the eventlog from all the journal entries sent to me. It was uglier than expected. But might do the trick.

A better solution might be to re-work the journaling, to only send journal updates after eventlogs are written to the KVS. that was the far trickier option at the moment, that chunk of code in the job-manager is quite tricky. See https://github.com/flux-framework/flux-core/issues/3931

garlick commented 2 years ago

I was going to suggest just reconstructing the event log from the journal. What did the problem turn out to be with that?

On Mon, Apr 11, 2022, 4:30 PM Al Chu @.***> wrote:

the major snag: the job-list/job-db module can be sent the journal event for "job inactive" (eventlog "clean" event) BEFORE the eventlog "clean" event is written to the KVS. So there's a race condition here, which could lead to further complications down the road.

As an experiment, instead of reading the eventlog from the KVS I simply rebuilt the eventlog from all the journal entries sent to me. It was uglier than expected. But might do the trick.

A better solution might be to re-work the journaling, to only send journal updates after eventlogs are written to the KVS. that was the far trickier option at the moment, that chunk of code in the job-manager is quite tricky. See #3931 https://github.com/flux-framework/flux-core/issues/3931

— Reply to this email directly, view it on GitHub https://github.com/flux-framework/flux-core/issues/4273#issuecomment-1095691093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJPWYVBSGBGCSPVRLTP2LVESYY7ANCNFSM5SWPL7PA . You are receiving this because you authored the thread.Message ID: @.***>

chu11 commented 2 years ago

What did the problem turn out to be with that?

Mostly just hitting corner cases :-) It'd be easier if I could just grab the eventlog, vs having to re-construct it ... and via the job-list restart.

grondo commented 2 years ago

What if instead of trying to make sure the journal events are delayed until the associated kvs commit is complete, we added an entry to the journal for jobs after the final entry is committed to the eventlog? Actually, isn't this already implied by the job-state events that are sent out? Could you use that to trigger a kvs lookup of all job data to be archived?

Edit: I mean if we ever want to store the exec eventlog or output eventlogs to allow this data to be queryable via the database, we'll have to get access to things other than the main eventlog anyway...

chu11 commented 2 years ago

What if instead of trying to make sure the journal events are delayed until the associated kvs commit is complete, we added an entry to the journal for jobs after the final entry is committed to the eventlog? Actually, isn't this already implied by the job-state events that are sent out? Could you use that to trigger a kvs lookup of all job data to be archived?

I was thinking about that as an option too, although I thought of writing it out to the eventlog. Just having some "flush" or something event in the journal only makes more sense.

I didn't pursue it for this prototype as I couldn't quite figure out how to easily add it in the job-manager events stuff.

This prototype I'm doing is also all spaghetti code, so it might not be as bad if I properly refactored things vs just cutting and pasting things lazily.

garlick commented 2 years ago

It'd be easier if I could just grab the eventlog, vs having to re-construct it ... and via the job-list restart.

I didn't quite understand what you mean by via the job-list restart. I was presuming job-db would not need to scan the KVS on restart since it already has all the inactive jobs so far...

chu11 commented 2 years ago

I didn't quite understand what you mean by via the job-list restart. I was presuming job-db would not need to scan the KVS on restart since it already has all the inactive jobs so far...

oh haha, yeah, you're right. I just haven't gotten to the restart stuff yet, hit bugs and went along trying to fix them as i hit them :-)

It'll probably end up being less annoying that I originally anticipated.

chu11 commented 2 years ago

I finally got an experimental branch that completes tests in sharness (note completes, not passes :-). This prototype is garbage of course, tons of memleaks, tons of lazy cut & paste, no refactoring. A nice chunk of tests don't pass as there's some complications or bug I just decided I didn't want solve or dig into for the time being.

https://github.com/chu11/flux-core/tree/issue4273_job_db_prototype

some random notes

job-list needs some refactoring as the code architecturally struggles with doing what job-db wants to do. For example, in this prototype, I actually put inactive jobs on the inactive list, then I take it off later when I write to the archive. How to refactor, TBD.
I still need to scan the KVS on restart b/c job-list builds up its internal list of pending / running jobs from this vs. the journal. Shouldn't be a big deal longer term when we start to remove inactive jobs from the KVS.

some thoughts:

I like @grondo's suggestion that we do a "in memory" database when a statedir is not configured (or perhaps a flag indicating "archive to sqlite"). In this hacky prototype, I had to change a ton of tests to specify a statedir, it's annoying. In addition, I wasn't even sure how to set statedir in some cases. e.g. how to specify unique statedir to each sub flux instance when using --cc.
OR based on a config, we could just keep the current inactive list approach if archiving is not required. Although I think that will make things more complicated.
I like @grondo's idea of perhaps storing tons of stuff in json and putting it into a JSON field within the archive. That would make this semi-nosql and make parsing data out of the archive a lot easier. Although we could probably (and perhaps should) create more columns for the commonly accessed stuff.

Next steps:

In terms of "is this reasonable", I think it is. Although I'm struggling with next steps.

some ideas:
Mixing the job-list and job-archive into one is some of the struggle in this prototype. So one idea I had was we could have a configurable "job archive" module / store configuration, similar to the broker's "backing store" module / configuration. One of the configurations could simply be "in memory job archive" (e.g. could just be a zhashx or similar).
- This could be a reasonable "middle point", allowing us to keep job-list and job-archive separate from each other and move forward without insane changes.
- Could lead to different "archive stores" in the future (sqlite, or nosql other options down the road).
- it would be easy to have the "in memory job archive" be the default if none is configured by the user
- We'd need some common "job data" format to share between the two, but I guess it could just be json.
- job stats would be hard to do with this approach, not sure how to do it at the moment
The work is different enough and the changes significant enough that I'm borderline thinking a re-write would be better / easier (borderline, but I don't think that threshold has been crossed yet). job-list did grow over time from its humble beginnings. It wouldn't be ridiculous to think we should re-start simple and grow from there.
debating if we should revisit https://github.com/flux-framework/flux-core/pull/3611. We're going to want to have a "recently inactive" cache list for the common job listings (should coincide with a different default for max_entries to return). Pondering if it'd be easier / better to this before / after job-db's first implementation.

garlick commented 2 years ago

still need to scan the KVS on restart b/c job-list builds up its internal list of pending / running jobs from this vs. the journal.

If we load job-list in the rc1 script right after job-manager, then aren't we assured that it can capture all pending/running jobs from the journal? If not maybe we need to rethink how the journal works, since the job-manager would have just finished doing the exact same thing and has it all in memory.

I like @grondo's suggestion that we do a "in memory" database when a statedir is not configured (or perhaps a flag indicating "archive to sqlite"). In this hacky prototype, I had to change a ton of tests to specify a statedir, it's annoying. In addition, I wasn't even sure how to set statedir in some cases. e.g. how to specify unique statedir to each sub flux instance when using --cc.

If it's easier, you could also do what content-sqlite does and fall back to rundir if statedir is undefined. This doesn't persist after the instance is done but it would use the same code paths as when statedir is defined.

No further thoughts at the moment except a nagging unease that we need to be thinking more broadly about how the job modules interact and try to adjust the design to make everything work better together if the current design is failing us.

chu11 commented 2 years ago

If we load job-list in the rc1 script right after job-manager, then aren't we assured that it can capture all pending/running jobs from the journal? If not maybe we need to rethink how the journal works, since the job-manager would have just finished doing the exact same thing and has it all in memory.

My skimming of the code is that we only send new events via the journal, not old ones. We could send all old events, which might make sense once we've started removing inactive jobs.

garlick commented 2 years ago

Ah gotcha. Right now all jobs are killed when an instance is restarted so we don't have that problem but we will. Good point. Let's see if we can work out how to let the job manager send everything over because then we can get rid of the sequence numbers and avoid duplicate effort.

Edit: I do have that on a branch somewhere, at least a prototype.

chu11 commented 2 years ago

Ah gotcha. Right now all jobs are killed when an instance is restarted so we don't have that problem but we will. Good point. Let's see if we can work out how to let the job manager send everything over because then we can get rid of the sequence numbers and avoid duplicate effort.

Was thinking about this problem today, one idea that came to me. When job-list requests a stream of journal events from the job-manager, could the job-manager send all the events for all currently pending & running jobs?

There is a small race where a job could finish and go inactive between the time job-manager starts and job-list requests the journal. Perhaps we could just cache these until job-list connects up. All other inactive jobs could be assumed to be in the sqlite archive.

chu11 commented 2 years ago

began experimenting with the sqlite JSON type and with the the new job-list "all" attribute, second attempt at a prototype job-db went a lot better. Now all job data is stored in a single JSON blob and put into the archive. So when re-loading data from the archive, its easy, just reading one json blob back.

All tests pass except for a few related to the new job-manager purge inactive jobs functionality (purging in job-list doesn't make sense when all inactive jobs are in sqlite), just didn't deal with it at the moment.

https://github.com/chu11/flux-core/tree/issue4273_job_db_prototype_archive_oneblob

interesting notes

as I commented in #4234, sqlite's json support is optionally built in, so it's not clear what versions support it and what versions might not. So I think it may be best to not assume json support. So this required the following table to be in my prototype:

const char *sql_create_table = "CREATE TABLE if not exists jobs("
                               "  id CHAR(16) PRIMARY KEY,"
                               "  t_inactive REAL,"
                               "  jobdata JSON,"
                               "  eventlog TEXT,"
                               "  jobspec JSON,"
                               "  R JSON"

basically, we still need the t_inactive column so we can sort jobs (via the sql query) that we return to the user.

when statedir isn't available, i just store all archive data to memory but discussion above. not sure if this is a wise plan long term or not.

Got some far more clear refactorings going forward that will hopefully get us closer to where we want to be:

support states_mask in job-list, the bitmask that holds all job states that have been reached
there is inconsistency between some job-list values being returned with a "default" and some that aren't returned at all. It would make sense to make those consistent.
there is inconsistency between how job-list internally stores data, e.g. const char *name might point to a value inside the jobspec, where as char *ranks is strdup-ed. This became bothersome upon sqlite re-load of data and took me awhile to hunt down mem-corruption. Not sure how to refactor at the moment.
also would be wise to refactor the struct job into its own file or something
with the job-list all attribute, it might be worthwhile to add a version to the json object, in case of changes in the future (related to #2635)

Update:

and we'll also want to have a inactive cache. I'm thinking that keeping the list of inactive jobs, but simply kicking off entries when the list gets too long would be reasonable to do. It might be worthwhile to loosely reconsider what I did in #3611 for this. As an initial implementation, we can default the list length to "infinite", as to maintain current behavior. Then once job-archive is integrated in, we can change the default to something else.
note that this would be in contrast to a LRU cache,which I'm thinking will be the not as common case. i.e. some users will want to look up old jobs once in awhile, but there won't be a need to cache those recent lookups b/c they won't happen again (that often). The most recently run inactive jobs will be the common case b/c of use with flux jobs.
Edit: Hmmm, actually. I think we could keep the current purge support. We simply store to the job-db/archive whenever an item is added to the inactive list. Would just have to add some smarts to look in both places (possibly).

Update2:

oh yeah, i have increasing belief we don't need to create a new module. We can add job-db to job-list.

chu11 commented 2 years ago

i think i have a prototype that is approaching something "good" that we'd like to consider for actual use.

add database to the job-list module
we should only support database archiving if a dbpath or statedir is configured, if not, job-list behaves just like it does now (including behavior with job purging). In other words, job-list can lose job history in a non-statedir / non-dbpath configured instance.
- this is also convenient, as an "inactive job cache" / inactive job list just sort of is left in place
I'm going with a db table of

const char *sql_create_table = "CREATE TABLE if not exists jobs("
                               "  id CHAR(16) PRIMARY KEY,"
                               "  t_inactive REAL,"
                               "  jobdata JSON,"
                               "  eventlog TEXT,"
                               "  jobspec JSON,"
                               "  R JSON"

The main reason we have the t_inactive is b/c this is backwards compatible to sqlite versions that don't have json support.

garlick commented 2 years ago

did we decide that job-archive could stick around as long as flux-accounting needs it (but ultimately flux-accounting would transition to this?)

chu11 commented 2 years ago

@garlick yeah, job-archive will stick around until it is no longer needed.

flux-framework / flux-core

implement prototype job-db module #4273