Open garlick opened 12 months ago
Oh there is an epilog-start event but no epilog-finish.
Confirmed, all the jobs in this state have an epilog-start but no epilog-finish.
Also note that the time between the epilog start and the exception is about 2h. Maybe the epilog was stuck. The system was supposedly having NFS issues.
Side note: I'm not sure how to get rid of this job. This doesn't work:
$ flux job purge -f f2cnGN8MBAu5
flux-job: purge: cannot purge active job
It would be handy if we could force it somehow.
I think this is due to known issue #4108. There is no epilog-finish
event because the perilog plugin lost track of the epilog process after the restart.
A solution for now would be to load a temporary plugin that emits the epilog-finish event for the stuck jobs. Sorry we don't have anything better to offer at this time.
Here's a jobtap plugin that might work to post the missing epilog-finish
events for jobs in this state. It should be loaded like:
$ flux jobtap load /path/to/plugin.so jobs="[ID1, ID2, ...]"
Where ID1
etc are the integer representations of the jobids to target.
The plugin should then be manually removed with flux jobtap remove plugin.so
after it is complete.
Of course, I could not test this since I don't have any stuck jobs handy...
#include <jansson.h>
#include <flux/jobtap.h>
static int post_events (flux_t *h,
flux_plugin_t *p,
json_t *jobs,
const char *description)
{
size_t index;
json_t *entry;
json_array_foreach (jobs, index, entry) {
flux_jobid_t id;
if (!json_is_integer (entry)) {
flux_log_error (h,
"invalid jobid '%s'",
json_string_value (entry));
return -1;
}
id = json_integer_value (entry);
if (flux_jobtap_epilog_finish (p, id, description, 0) < 0)
flux_log_error (h,
"failed to post epilog-finish event for %ju",
(uintmax_t) id);
}
return 0;
}
int flux_plugin_init (flux_plugin_t *p)
{
json_t *jobs;
const char *description = "job-manager.epilog";
flux_t *h = flux_jobtap_get_flux (p);
if (flux_plugin_conf_unpack (p,
"{s?s s:o}",
"descripion", &description,
"jobs", &jobs) < 0) {
flux_log_error (h, "no jobids provided");
return -1;
}
if (!json_is_array (jobs)) {
flux_log_error (h, "jobs conf value must be array");
return -1;
}
return post_events (h, p, jobs, description);
}
NIce! I'll run that on corona in a bit when I'm back online.
FYI, I just ran the above jobtap plugin against all CLEANUP jobs on Corona and they're all inactive now, e.g.:
# flux job eventlog f2cx7pwPP9CK
1700501129.045566 submit userid=60943 urgency=16 flags=0 version=1
1700501129.063712 validate
1700501129.077765 depend
1700501129.077866 priority priority=24028
1700501129.112807 alloc
1700501129.116187 prolog-start description="job-manager.prolog"
1700501129.116215 prolog-start description="cray-pals-port-distributor"
1700501129.374213 cray_port_distribution ports=[11979,11978] random_integer=-4201776475281207022
1700501129.374432 prolog-finish description="cray-pals-port-distributor" status=0
1700501130.060089 prolog-finish description="job-manager.prolog" status=0
1700501130.067440 start
1700501130.366923 memo uri="ssh://corona171/var/tmp/hobbs17/flux-V1b6Nu/local-0"
1700501137.525261 finish status=0
1700501137.526100 epilog-start description="job-manager.epilog"
1700501137.793074 release ranks="all" final=true
1700510565.815566 exception type="scheduler-restart" severity=0 note="failed to reallocate R for running job" userid=765
1701101824.652934 epilog-finish description="job-manager.epilog" status=0
1701101824.653028 clean
Closed by #5848?
Probably not, that was just addition of a utility to fix this state.
We can consider this resolved when we can recapture or restart any pending epilog processes/procedures and ensure they've completed or restart them as necessary.
Problem: the corona management node was shut down abruptly. Afterwards,
flux jobs
shows several jobs in C state:For example:
Job eventlog says