job-exec: add total time waited for a job in drain message for unkillable processes

flux-framework / flux-core

core services for the Flux resource management framework

GNU Lesser General Public License v3.0

168 stars 50 forks source link

job-exec: add total time waited for a job in drain message for unkillable processes #6376

Open grondo opened 1 month ago

grondo commented 1 month ago

Problem: The job-exec module drains nodes with what it considered "unkillable" processes after max-kill-count attempts have been made to terminate the job shell. However, it is difficult for admins to determine how long that actually took, because the module uses an exponential backoff up to a max of 300s when retrying to kill the job shell.

Consider logging the total time waited until draining nodes for reference.