jobs remain in CLEANUP state after fatal scheduler-restart exception

garlick commented 12 months ago

Problem: the corona management node was shut down abruptly. Afterwards, flux jobs shows several jobs in C state:

For example:

f2cx7pwPP9CK pdebug   hobbs17  flux        C      2      2   2.621h corona[171,173]

Job eventlog says

$ flux job eventlog f2cx7pwPP9CK
1700501129.045566 submit userid=60943 urgency=16 flags=0 version=1
1700501129.063712 validate
1700501129.077765 depend
1700501129.077866 priority priority=24028
1700501129.112807 alloc
1700501129.116187 prolog-start description="job-manager.prolog"
1700501129.116215 prolog-start description="cray-pals-port-distributor"
1700501129.374213 cray_port_distribution ports=[11979,11978] random_integer=-4201776475281207022
1700501129.374432 prolog-finish description="cray-pals-port-distributor" status=0
1700501130.060089 prolog-finish description="job-manager.prolog" status=0
1700501130.067440 start
1700501130.366923 memo uri="ssh://corona171/var/tmp/hobbs17/flux-V1b6Nu/local-0"
1700501137.525261 finish status=0
1700501137.526100 epilog-start description="job-manager.epilog"
1700501137.793074 release ranks="all" final=true
1700510565.815566 exception type="scheduler-restart" severity=0 note="failed to reallocate R for running job" userid=765

$ flux job list -u hobbs17 | jq
{
  "id": 697830805111572480,
  "userid": 60943,
  "urgency": 16,
  "priority": 24028,
  "t_submit": 1700501129.0455658,
  "t_depend": 1700501129.0637124,
  "t_run": 1700501129.1128068,
  "t_cleanup": 1700510565.8155658,
  "state": 32,
  "name": "flux",
  "cwd": "/g/g0/hobbs17",
  "queue": "pdebug",
  "ntasks": 2,
  "ncores": 96,
  "duration": 1800.0,
  "nnodes": 2,
  "ranks": "[4,6]",
  "nodelist": "corona[171,173]",
  "expiration": 1700502929.0,
  "annotations": {
    "user": {
      "uri": "ssh://corona171/var/tmp/hobbs17/flux-V1b6Nu/local-0"
    }
  },
  "waitstatus": 0
}

garlick commented 12 months ago

Oh there is an epilog-start event but no epilog-finish.

garlick commented 12 months ago

Confirmed, all the jobs in this state have an epilog-start but no epilog-finish.

Also note that the time between the epilog start and the exception is about 2h. Maybe the epilog was stuck. The system was supposedly having NFS issues.

garlick commented 12 months ago

Side note: I'm not sure how to get rid of this job. This doesn't work:

$ flux job purge -f f2cnGN8MBAu5
flux-job: purge: cannot purge active job

It would be handy if we could force it somehow.

grondo commented 12 months ago

I think this is due to known issue #4108. There is no epilog-finish event because the perilog plugin lost track of the epilog process after the restart.

A solution for now would be to load a temporary plugin that emits the epilog-finish event for the stuck jobs. Sorry we don't have anything better to offer at this time.

grondo commented 12 months ago

Here's a jobtap plugin that might work to post the missing epilog-finish events for jobs in this state. It should be loaded like:

$ flux jobtap load /path/to/plugin.so jobs="[ID1, ID2, ...]"

Where ID1 etc are the integer representations of the jobids to target.

The plugin should then be manually removed with flux jobtap remove plugin.so after it is complete.

Of course, I could not test this since I don't have any stuck jobs handy...


#include <jansson.h>
#include <flux/jobtap.h>

static int post_events (flux_t *h,
                        flux_plugin_t *p,
                        json_t *jobs,
                        const char *description)
{
    size_t index;
    json_t *entry;

    json_array_foreach (jobs, index, entry) {
        flux_jobid_t id;
        if (!json_is_integer (entry)) {
            flux_log_error (h,
                            "invalid jobid '%s'",
                            json_string_value (entry));
            return -1;
        }
        id = json_integer_value (entry);
        if (flux_jobtap_epilog_finish (p, id, description, 0) < 0)
            flux_log_error (h,
                            "failed to post epilog-finish event for %ju",
                            (uintmax_t) id);
    }
    return 0;
}

int flux_plugin_init (flux_plugin_t *p)
{
    json_t *jobs;
    const char *description = "job-manager.epilog";
    flux_t *h = flux_jobtap_get_flux (p);

    if (flux_plugin_conf_unpack (p,
                                 "{s?s s:o}",
                                 "descripion", &description,
                                 "jobs", &jobs) < 0) {
        flux_log_error (h, "no jobids provided");
        return -1;
    }
    if (!json_is_array (jobs)) {
        flux_log_error (h, "jobs conf value must be array");
        return -1;
    }
    return post_events (h, p, jobs, description);
}

garlick commented 12 months ago

NIce! I'll run that on corona in a bit when I'm back online.

grondo commented 11 months ago

FYI, I just ran the above jobtap plugin against all CLEANUP jobs on Corona and they're all inactive now, e.g.:

# flux job eventlog f2cx7pwPP9CK
1700501129.045566 submit userid=60943 urgency=16 flags=0 version=1
1700501129.063712 validate
1700501129.077765 depend
1700501129.077866 priority priority=24028
1700501129.112807 alloc
1700501129.116187 prolog-start description="job-manager.prolog"
1700501129.116215 prolog-start description="cray-pals-port-distributor"
1700501129.374213 cray_port_distribution ports=[11979,11978] random_integer=-4201776475281207022
1700501129.374432 prolog-finish description="cray-pals-port-distributor" status=0
1700501130.060089 prolog-finish description="job-manager.prolog" status=0
1700501130.067440 start
1700501130.366923 memo uri="ssh://corona171/var/tmp/hobbs17/flux-V1b6Nu/local-0"
1700501137.525261 finish status=0
1700501137.526100 epilog-start description="job-manager.epilog"
1700501137.793074 release ranks="all" final=true
1700510565.815566 exception type="scheduler-restart" severity=0 note="failed to reallocate R for running job" userid=765
1701101824.652934 epilog-finish description="job-manager.epilog" status=0
1701101824.653028 clean

garlick commented 7 months ago

Closed by #5848?

grondo commented 7 months ago

Probably not, that was just addition of a utility to fix this state.

We can consider this resolved when we can recapture or restart any pending epilog processes/procedures and ensure they've completed or restart them as necessary.

flux-framework / flux-core

jobs remain in CLEANUP state after fatal scheduler-restart exception #5579