flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
163 stars 49 forks source link

jobs remain in CLEANUP state after fatal scheduler-restart exception #5579

Open garlick opened 8 months ago

garlick commented 8 months ago

Problem: the corona management node was shut down abruptly. Afterwards, flux jobs shows several jobs in C state:

For example:

f2cx7pwPP9CK pdebug   hobbs17  flux        C      2      2   2.621h corona[171,173]

Job eventlog says

$ flux job eventlog f2cx7pwPP9CK
1700501129.045566 submit userid=60943 urgency=16 flags=0 version=1
1700501129.063712 validate
1700501129.077765 depend
1700501129.077866 priority priority=24028
1700501129.112807 alloc
1700501129.116187 prolog-start description="job-manager.prolog"
1700501129.116215 prolog-start description="cray-pals-port-distributor"
1700501129.374213 cray_port_distribution ports=[11979,11978] random_integer=-4201776475281207022
1700501129.374432 prolog-finish description="cray-pals-port-distributor" status=0
1700501130.060089 prolog-finish description="job-manager.prolog" status=0
1700501130.067440 start
1700501130.366923 memo uri="ssh://corona171/var/tmp/hobbs17/flux-V1b6Nu/local-0"
1700501137.525261 finish status=0
1700501137.526100 epilog-start description="job-manager.epilog"
1700501137.793074 release ranks="all" final=true
1700510565.815566 exception type="scheduler-restart" severity=0 note="failed to reallocate R for running job" userid=765
$ flux job list -u hobbs17 | jq
{
  "id": 697830805111572480,
  "userid": 60943,
  "urgency": 16,
  "priority": 24028,
  "t_submit": 1700501129.0455658,
  "t_depend": 1700501129.0637124,
  "t_run": 1700501129.1128068,
  "t_cleanup": 1700510565.8155658,
  "state": 32,
  "name": "flux",
  "cwd": "/g/g0/hobbs17",
  "queue": "pdebug",
  "ntasks": 2,
  "ncores": 96,
  "duration": 1800.0,
  "nnodes": 2,
  "ranks": "[4,6]",
  "nodelist": "corona[171,173]",
  "expiration": 1700502929.0,
  "annotations": {
    "user": {
      "uri": "ssh://corona171/var/tmp/hobbs17/flux-V1b6Nu/local-0"
    }
  },
  "waitstatus": 0
}
garlick commented 8 months ago

Oh there is an epilog-start event but no epilog-finish.

garlick commented 8 months ago

Confirmed, all the jobs in this state have an epilog-start but no epilog-finish.

Also note that the time between the epilog start and the exception is about 2h. Maybe the epilog was stuck. The system was supposedly having NFS issues.

garlick commented 8 months ago

Side note: I'm not sure how to get rid of this job. This doesn't work:

$ flux job purge -f f2cnGN8MBAu5
flux-job: purge: cannot purge active job

It would be handy if we could force it somehow.

grondo commented 8 months ago

I think this is due to known issue #4108. There is no epilog-finish event because the perilog plugin lost track of the epilog process after the restart.

A solution for now would be to load a temporary plugin that emits the epilog-finish event for the stuck jobs. Sorry we don't have anything better to offer at this time.

grondo commented 8 months ago

Here's a jobtap plugin that might work to post the missing epilog-finish events for jobs in this state. It should be loaded like:

$ flux jobtap load /path/to/plugin.so jobs="[ID1, ID2, ...]"

Where ID1 etc are the integer representations of the jobids to target.

The plugin should then be manually removed with flux jobtap remove plugin.so after it is complete.

Of course, I could not test this since I don't have any stuck jobs handy...


#include <jansson.h>
#include <flux/jobtap.h>

static int post_events (flux_t *h,
                        flux_plugin_t *p,
                        json_t *jobs,
                        const char *description)
{
    size_t index;
    json_t *entry;

    json_array_foreach (jobs, index, entry) {
        flux_jobid_t id;
        if (!json_is_integer (entry)) {
            flux_log_error (h,
                            "invalid jobid '%s'",
                            json_string_value (entry));
            return -1;
        }
        id = json_integer_value (entry);
        if (flux_jobtap_epilog_finish (p, id, description, 0) < 0)
            flux_log_error (h,
                            "failed to post epilog-finish event for %ju",
                            (uintmax_t) id);
    }
    return 0;
}

int flux_plugin_init (flux_plugin_t *p)
{
    json_t *jobs;
    const char *description = "job-manager.epilog";
    flux_t *h = flux_jobtap_get_flux (p);

    if (flux_plugin_conf_unpack (p,
                                 "{s?s s:o}",
                                 "descripion", &description,
                                 "jobs", &jobs) < 0) {
        flux_log_error (h, "no jobids provided");
        return -1;
    }
    if (!json_is_array (jobs)) {
        flux_log_error (h, "jobs conf value must be array");
        return -1;
    }
    return post_events (h, p, jobs, description);
}
garlick commented 8 months ago

NIce! I'll run that on corona in a bit when I'm back online.

grondo commented 8 months ago

FYI, I just ran the above jobtap plugin against all CLEANUP jobs on Corona and they're all inactive now, e.g.:

# flux job eventlog f2cx7pwPP9CK
1700501129.045566 submit userid=60943 urgency=16 flags=0 version=1
1700501129.063712 validate
1700501129.077765 depend
1700501129.077866 priority priority=24028
1700501129.112807 alloc
1700501129.116187 prolog-start description="job-manager.prolog"
1700501129.116215 prolog-start description="cray-pals-port-distributor"
1700501129.374213 cray_port_distribution ports=[11979,11978] random_integer=-4201776475281207022
1700501129.374432 prolog-finish description="cray-pals-port-distributor" status=0
1700501130.060089 prolog-finish description="job-manager.prolog" status=0
1700501130.067440 start
1700501130.366923 memo uri="ssh://corona171/var/tmp/hobbs17/flux-V1b6Nu/local-0"
1700501137.525261 finish status=0
1700501137.526100 epilog-start description="job-manager.epilog"
1700501137.793074 release ranks="all" final=true
1700510565.815566 exception type="scheduler-restart" severity=0 note="failed to reallocate R for running job" userid=765
1701101824.652934 epilog-finish description="job-manager.epilog" status=0
1701101824.653028 clean
garlick commented 3 months ago

Closed by #5848?

grondo commented 3 months ago

Probably not, that was just addition of a utility to fix this state.

We can consider this resolved when we can recapture or restart any pending epilog processes/procedures and ensure they've completed or restart them as necessary.