flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 49 forks source link

implement prototype job-db module #4273

Open garlick opened 2 years ago

garlick commented 2 years ago

Problem: the job-list module's memory cache grows without bound, and tracks KVS which we need to start pruning. The job-archive module is mainly structured to feed data to flux-accounting and offers no on-line query support.

We want to answer the question: is it feasible to develop a new module that answers job-list queries exactly as the job-list module does now, but instead of only caching jobs in memory, retains inactive job data in a persistent database?

Some notes from our discussions

The goal of prototyping here would be to quickly get something functional that we can discuss.

Automated testing could be deferred until we decide whether to move forward.

chu11 commented 2 years ago

some random thoughts, not necessarily for this prototype, but just thoughts for later (if this works out):

garlick commented 2 years ago

A suggestion would be to leave off guest.exec.eventlog, stdio, annotations, and memos for this experiment. We can always add later.

Also, a simplifying assumption may be that (unlike job-list) no kvs scan is needed on restart. I think we can make that work by adding some way to flush the job manager journal to consumers and end with an ENODATA during the broker CLEANUP state, before rc3 runs.

chu11 commented 2 years ago

I hit a snag during my prototyping today and thought I'd do a thought dump before moving forward:

notes on technical things to resolve

other thoughts / notes:

overall:

grondo commented 2 years ago

currently the job-archive errors out if no statedir is specified. how to deal with this with a job-db module? obviously don't want to error out and have no job-listing?? TBD.

I would have to ponder some of the other questions more, however, on this specific issue, sqlite has an in memory db mode which could be used if there is no statedir or if we want to default to no backing file.

Wish we had a nosql database we could just add columns as needed.

I'm definitely not an expert, but when researching JSON columns a lot of SO posts were describing how they use these columns as a cheap nosql database, or later add new columns/indexes on demand when necessary. Might want to do more research on that.

chu11 commented 2 years ago

the major snag: the job-list/job-db module can be sent the journal event for "job inactive" (eventlog "clean" event) BEFORE the eventlog "clean" event is written to the KVS. So there's a race condition here, which could lead to further complications down the road.

As an experiment, instead of reading the eventlog from the KVS I simply rebuilt the eventlog from all the journal entries sent to me. It was uglier than expected. But might do the trick.

A better solution might be to re-work the journaling, to only send journal updates after eventlogs are written to the KVS. that was the far trickier option at the moment, that chunk of code in the job-manager is quite tricky. See https://github.com/flux-framework/flux-core/issues/3931

garlick commented 2 years ago

I was going to suggest just reconstructing the event log from the journal. What did the problem turn out to be with that?

On Mon, Apr 11, 2022, 4:30 PM Al Chu @.***> wrote:

the major snag: the job-list/job-db module can be sent the journal event for "job inactive" (eventlog "clean" event) BEFORE the eventlog "clean" event is written to the KVS. So there's a race condition here, which could lead to further complications down the road.

As an experiment, instead of reading the eventlog from the KVS I simply rebuilt the eventlog from all the journal entries sent to me. It was uglier than expected. But might do the trick.

A better solution might be to re-work the journaling, to only send journal updates after eventlogs are written to the KVS. that was the far trickier option at the moment, that chunk of code in the job-manager is quite tricky. See #3931 https://github.com/flux-framework/flux-core/issues/3931

— Reply to this email directly, view it on GitHub https://github.com/flux-framework/flux-core/issues/4273#issuecomment-1095691093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJPWYVBSGBGCSPVRLTP2LVESYY7ANCNFSM5SWPL7PA . You are receiving this because you authored the thread.Message ID: @.***>

chu11 commented 2 years ago

What did the problem turn out to be with that?

Mostly just hitting corner cases :-) It'd be easier if I could just grab the eventlog, vs having to re-construct it ... and via the job-list restart.

grondo commented 2 years ago

What if instead of trying to make sure the journal events are delayed until the associated kvs commit is complete, we added an entry to the journal for jobs after the final entry is committed to the eventlog? Actually, isn't this already implied by the job-state events that are sent out? Could you use that to trigger a kvs lookup of all job data to be archived?

Edit: I mean if we ever want to store the exec eventlog or output eventlogs to allow this data to be queryable via the database, we'll have to get access to things other than the main eventlog anyway...

chu11 commented 2 years ago

What if instead of trying to make sure the journal events are delayed until the associated kvs commit is complete, we added an entry to the journal for jobs after the final entry is committed to the eventlog? Actually, isn't this already implied by the job-state events that are sent out? Could you use that to trigger a kvs lookup of all job data to be archived?

I was thinking about that as an option too, although I thought of writing it out to the eventlog. Just having some "flush" or something event in the journal only makes more sense.

I didn't pursue it for this prototype as I couldn't quite figure out how to easily add it in the job-manager events stuff.

This prototype I'm doing is also all spaghetti code, so it might not be as bad if I properly refactored things vs just cutting and pasting things lazily.

garlick commented 2 years ago

It'd be easier if I could just grab the eventlog, vs having to re-construct it ... and via the job-list restart.

I didn't quite understand what you mean by via the job-list restart. I was presuming job-db would not need to scan the KVS on restart since it already has all the inactive jobs so far...

chu11 commented 2 years ago

I didn't quite understand what you mean by via the job-list restart. I was presuming job-db would not need to scan the KVS on restart since it already has all the inactive jobs so far...

oh haha, yeah, you're right. I just haven't gotten to the restart stuff yet, hit bugs and went along trying to fix them as i hit them :-)

It'll probably end up being less annoying that I originally anticipated.

chu11 commented 2 years ago

I finally got an experimental branch that completes tests in sharness (note completes, not passes :-). This prototype is garbage of course, tons of memleaks, tons of lazy cut & paste, no refactoring. A nice chunk of tests don't pass as there's some complications or bug I just decided I didn't want solve or dig into for the time being.

https://github.com/chu11/flux-core/tree/issue4273_job_db_prototype

some random notes

some thoughts:

Next steps:

In terms of "is this reasonable", I think it is. Although I'm struggling with next steps.

garlick commented 2 years ago

still need to scan the KVS on restart b/c job-list builds up its internal list of pending / running jobs from this vs. the journal.

If we load job-list in the rc1 script right after job-manager, then aren't we assured that it can capture all pending/running jobs from the journal? If not maybe we need to rethink how the journal works, since the job-manager would have just finished doing the exact same thing and has it all in memory.

I like @grondo's suggestion that we do a "in memory" database when a statedir is not configured (or perhaps a flag indicating "archive to sqlite"). In this hacky prototype, I had to change a ton of tests to specify a statedir, it's annoying. In addition, I wasn't even sure how to set statedir in some cases. e.g. how to specify unique statedir to each sub flux instance when using --cc.

If it's easier, you could also do what content-sqlite does and fall back to rundir if statedir is undefined. This doesn't persist after the instance is done but it would use the same code paths as when statedir is defined.

No further thoughts at the moment except a nagging unease that we need to be thinking more broadly about how the job modules interact and try to adjust the design to make everything work better together if the current design is failing us.

chu11 commented 2 years ago

If we load job-list in the rc1 script right after job-manager, then aren't we assured that it can capture all pending/running jobs from the journal? If not maybe we need to rethink how the journal works, since the job-manager would have just finished doing the exact same thing and has it all in memory.

My skimming of the code is that we only send new events via the journal, not old ones. We could send all old events, which might make sense once we've started removing inactive jobs.

garlick commented 2 years ago

Ah gotcha. Right now all jobs are killed when an instance is restarted so we don't have that problem but we will. Good point. Let's see if we can work out how to let the job manager send everything over because then we can get rid of the sequence numbers and avoid duplicate effort.

Edit: I do have that on a branch somewhere, at least a prototype.

chu11 commented 2 years ago

Ah gotcha. Right now all jobs are killed when an instance is restarted so we don't have that problem but we will. Good point. Let's see if we can work out how to let the job manager send everything over because then we can get rid of the sequence numbers and avoid duplicate effort.

Was thinking about this problem today, one idea that came to me. When job-list requests a stream of journal events from the job-manager, could the job-manager send all the events for all currently pending & running jobs?

There is a small race where a job could finish and go inactive between the time job-manager starts and job-list requests the journal. Perhaps we could just cache these until job-list connects up. All other inactive jobs could be assumed to be in the sqlite archive.

chu11 commented 2 years ago

began experimenting with the sqlite JSON type and with the the new job-list "all" attribute, second attempt at a prototype job-db went a lot better. Now all job data is stored in a single JSON blob and put into the archive. So when re-loading data from the archive, its easy, just reading one json blob back.

All tests pass except for a few related to the new job-manager purge inactive jobs functionality (purging in job-list doesn't make sense when all inactive jobs are in sqlite), just didn't deal with it at the moment.

https://github.com/chu11/flux-core/tree/issue4273_job_db_prototype_archive_oneblob

interesting notes

const char *sql_create_table = "CREATE TABLE if not exists jobs("
                               "  id CHAR(16) PRIMARY KEY,"
                               "  t_inactive REAL,"
                               "  jobdata JSON,"
                               "  eventlog TEXT,"
                               "  jobspec JSON,"
                               "  R JSON"

basically, we still need the t_inactive column so we can sort jobs (via the sql query) that we return to the user.

Got some far more clear refactorings going forward that will hopefully get us closer to where we want to be:

Update:

Update2:

chu11 commented 2 years ago

i think i have a prototype that is approaching something "good" that we'd like to consider for actual use.

const char *sql_create_table = "CREATE TABLE if not exists jobs("
                               "  id CHAR(16) PRIMARY KEY,"
                               "  t_inactive REAL,"
                               "  jobdata JSON,"
                               "  eventlog TEXT,"
                               "  jobspec JSON,"
                               "  R JSON"

The main reason we have the t_inactive is b/c this is backwards compatible to sqlite versions that don't have json support.

garlick commented 2 years ago

did we decide that job-archive could stick around as long as flux-accounting needs it (but ultimately flux-accounting would transition to this?)

chu11 commented 2 years ago

@garlick yeah, job-archive will stick around until it is no longer needed.