distribworks / dkron

Dkron - Distributed, fault tolerant job scheduling system https://dkron.io
GNU Lesser General Public License v3.0
4.28k stars 379 forks source link

Shell Jobs randomly getting 'stuck' #1483

Open giggsey opened 7 months ago

giggsey commented 7 months ago

Describe the bug

Some shell jobs are randomly getting stuck in the 'busy' state, and don't clear until we restart the dkron agent.

The job processes launched by dkron agent are finishing as expected and exiting cleanly. Dkron doesn't notice, so the job stays in the 'busy' status, and further executions don't happen (we prevent concurrency).

We recently upgraded from 3.2.0 to 3.2.7, and this problem is now happening randomly with some of our jobs.

Job Description

{

    "id": "cloud_sync_user-data_private_app-server",
    "name": "cloud_sync_user-data_private_app-server",
    "displayname": "",
    "timezone": "",
    "schedule": "@every 5m",
    "owner": "cloud_sync",
    "owner_email": "",
    "success_count": 134712,
    "error_count": 506,
    "last_success": "2024-02-22T09:50:48.739347143Z",
    "last_error": "2024-02-22T09:40:03.273059208Z",
    "disabled": false,
    "tags": {
        "my_tag": "private_app-server:1"
    }

    ,
    "metadata": null,
    "retries": 0,
    "dependent_jobs": null,
    "parent_job": "",
    "processors": {}

    ,
    "concurrency": "forbid",
    "executor": "shell",
    "executor_config": {
        "command": "/usr/local/bin/dkron_run 0002 \"my-daemons\" \"/var/log/dkron-jobs/cloud_sync_user-data_private_app-server.log\" \"PATH=/sbin:/bin:/usr/sbin:/usr/bin /path/cloud_sync/bin/sync user-data -vvv\""
    }

    ,
    "status": "success",
    "next": "2024-02-22T09:58:52Z",
    "ephemeral": false,
    "expires_at": null
}

Logs

The job took around ~15 seconds to execute (I can see from the logs that my dkron_run script generates).

2024-02-22T01:48:52.020093+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:52Z" level=info msg="grpc_agent: Starting job" job=cloud_sync_user-data_private_app-server node=HR-P-APP-2.production.my.network
2024-02-22T01:48:52.020272+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:52Z" level=debug msg="grpc_agent: calling executor plugin" node=HR-P-APP-2.production.my.network plugin=shell
2024-02-22T01:48:52.020993+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:52Z" level=debug msg="2024/02/22 01:48:52 shell: going to run /usr/local/bin/dkron_run 0002 \"my-daemons\" \"/var/log/dkron-jobs/cloud_sync_user-data_private_app-server.log\" \"PATH=/sbin:/bin:/usr/sbin:/usr/bin /path/cloud_sync/bin/sync user-data -vvv\"" node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell

2024-02-22T01:49:07.300497+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:49:07Z" level=debug msg="2024/02/22 01:49:07 NewProcess err: process does not exist" node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell
2024-02-22T01:49:07.300664+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:49:07Z" level=debug msg="2024/02/22 01:49:07 Error getting pid statistics: process does not exist" node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell

Logs for a good job that ran at around the same time. This other job took less than a second to execute.

2024-02-22T01:48:53.019439+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:53Z" level=info msg="grpc_agent: Starting job" job=cloud_sync-alert-data_private_app-server node=HR-P-APP-2.production.my.network
2024-02-22T01:48:53.020427+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:53Z" level=debug msg="grpc_agent: calling executor plugin" node=HR-P-APP-2.production.my.network plugin=shell
2024-02-22T01:48:53.021941+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:53Z" level=debug msg="2024/02/22 01:48:53 shell: going to run /usr/local/bin/dkron_run 0002 \"my-daemons\" \"/var/log/dkron-jobs/cloud_sync-alert-data_private_app-server.log\" \"PATH=/sbin:/bin:/usr/sbin:/usr/bin /path/cloud_sync/bin/sync alert-data -vvv\"" node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell
2024-02-22T01:48:56.043127+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:56Z" level=debug msg="2024/02/22 01:48:56 NewProcess err: process does not exist" node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell
2024-02-22T01:48:56.043387+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:56Z" level=debug msg="2024/02/22 01:48:56 Error getting pid statistics: process does not exist" node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell
2024-02-22T01:48:56.043548+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:56Z" level=debug msg="2024/02/22 01:48:56 shell: Command output " node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell
2024-02-22T01:48:57.591651+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:57Z" level=info msg="2024/02/22 01:48:57 [DEBUG] memberlist: Initiating push/pull sync with: HR-P-SQL-3.production.my.network 192.168.181.23:8946"

Specifications

morgan-atproperties commented 6 months ago

We are seeing the same thing. We were running 3.2.6 previously, and we're going to roll back to that version.

mladoi commented 6 months ago

Having the same issue with 3.2.7. Job is merely calling a redis-cli publish check 1 every 5 minutes. Approx. 1 in 1000 job executions get stuck in busy state forever. Even setting timeout in executor_config doesn't change the busy state.

morgan-atproperties commented 6 months ago

Rolling back to 3.2.6 fixed it for us.