Nextdoor / ndscheduler

A flexible python library for building your own cron-like system, with REST APIs and a Web UI.
BSD 2-Clause "Simplified" License
1.08k stars 202 forks source link

Jobs stuck at "running" #86

Open mpermana opened 4 years ago

mpermana commented 4 years ago

This can happen when a job is running and the ndscheduler process died.

I.e to reproduce: can create shell job like that sleeps for a while i.e: ["bash","-c","sleep 3600"]

when the job is running, send kill signal, the next time ndscheduler starts, the job will be stuck at running.

palto42 commented 4 years ago

I can confirm this behavior. What would be needed is a database cleanup at the start of ndscheduler to change the status of those jobs to "failed" since they are most likely not completed.

palto42 commented 3 years ago

I submitted a PR #90 which cleans the database from such interrupted executions.

In my case the interruption was caused by running the ndscheduler via systemd unit which sends a SIGTERM at stop/restart and not the SIGINT which is expected by ndscheduler. It is possible to change the stop signal used by systemd unit to SIGINT in order to ensure graceful stop of ndscheduler. Another alternative would be to add SIGTERM in server.py alongside with the handler for SIGINT.