ajvb / kala

Modern Job Scheduler
MIT License
2.13k stars 188 forks source link

Very long running job - next schedule time calculation issue #149

Closed shunte88 closed 6 years ago

shunte88 commented 7 years ago

I have a job that ran for some 3 months at the correctly allocated time The job then ran long, so long it likely overlapped itself and dead-locked itself running a database process The issue was observed and the affected processes killed While chasing a request for data this AM we reviewed the schedule definition and found that the start time had moved Reviewing the log that we develop for the process this new scheduled time is the same hour & minute as when the job was killed

I'm going to surmise that the the calculation that is performed to determine the next run time threw a glitch because the very long process time - in excess of 24 hours, actually potentially days before the solution caught up

Obviously this is an extreme example but hopefully this information is of use

runtime stats - observe the time change

{ "job_id": "85a9ad27-c179-42df-4c0d-2bc16012602e", "ran_at": "2017-01-29T23:30:00.002918331Z", "number_of_retries": 0, "success": true, "execution_duration": 169104023611 }, { "job_id": "85a9ad27-c179-42df-4c0d-2bc16012602e", "ran_at": "2017-01-31T23:30:00.006173233Z", "number_of_retries": 0, "success": true, "execution_duration": 141530815675788 }, { "job_id": "85a9ad27-c179-42df-4c0d-2bc16012602e", "ran_at": "2017-02-02T14:48:50.821937456Z", "number_of_retries": 0, "success": false, "execution_duration": 2155156376784 }, { "job_id": "85a9ad27-c179-42df-4c0d-2bc16012602e", "ran_at": "2017-02-03T14:48:50.822134093Z", "number_of_retries": 0, "success": true, "execution_duration": 6163850644698 }, { "job_id": "85a9ad27-c179-42df-4c0d-2bc16012602e", "ran_at": "2017-02-04T14:48:50.822309419Z", "number_of_retries": 0, "success": true, "execution_duration": 1148506647346 }, { "job_id": "85a9ad27-c179-42df-4c0d-2bc16012602e", "ran_at": "2017-02-05T14:48:50.822477233Z", "number_of_retries": 0, "success": true, "execution_duration": 1063678685264 },

application log showing the deadlocks and reschedule glitch! 2017/01/31 23:30:00 /home/****/gocode/src/gitlab.****.com/****/****-aws-post/dd_post_process --module=OFFERAUTO >> /h$ 2017/01/31 23:30:00 dd_post_process.go:62: Promotion Module ..........: OFFERAUTO 2017/01/31 23:30:00 dbutil.go:1997: Active Functions Found ....: 1 2017/01/31 23:30:00 dbutil.go:2018: Function ..................: offers.offer_result_dyn_auto() 2017/02/02 14:48:50 dd_post_process.go:208: Fatal error, abending: ERROR: canceling statement due to user request (SQLSTATE 57014) Pinging the bosun server, http://spider.****.com:8070 at 2017-02-02 14:48:50.531523315 +0000 UTC with {"metric":"job.exit_status","timesta$ 2017/02/02 14:48:50 /home/****/gocode/src/gitlab.****.com/****/****-aws-post/dd_post_process --module=OFFERAUTO >> /h$ 2017/02/02 14:48:50 dd_post_process.go:62: Promotion Module ..........: OFFERAUTO 2017/02/02 14:48:50 dbutil.go:1997: Active Functions Found ....: 1 2017/02/02 14:48:50 dbutil.go:2018: Function ..................: offers.offer_result_dyn_auto() 2017/02/03 14:48:51 /home/****/gocode/src/gitlab.****.com/****/****-aws-post/dd_post_process --module=OFFERAUTO >> /ho$ 2017/02/03 14:48:51 dd_post_process.go:62: Promotion Module ..........: OFFERAUTO 2017/02/03 14:48:51 dbutil.go:1997: Active Functions Found ....: 1 2017/02/03 14:48:51 dbutil.go:2018: Function ..................: offers.offer_result_dyn_auto() 2017/02/03 16:31:34 dd_post_process.go:208: Fatal error, abending: ERROR: canceling statement due to user request (SQLSTATE 57014) Pinging the bosun server, http://spider.****.com:8070 at 2017-02-03 16:31:34.594741463 +0000 UTC with {"metric":"job.exit_status","timesta$ 2017/02/04 14:48:50 /home/****/gocode/src/gitlab.****.com/****/****-aws-post/dd_post_process --module=OFFERAUTO >> /ho$ 2017/02/04 14:48:50 dd_post_process.go:62: Promotion Module ..........: OFFERAUTO 2017/02/04 14:48:50 dbutil.go:1997: Active Functions Found ....: 1 2017/02/04 14:48:50 dbutil.go:2018: Function ..................: offers.offer_result_dyn_auto() 2017/02/04 15:07:59 dd_post_process.go:201: Done. 19m8.421136637s elapsed Pinging the bosun server, http://spider.****.com:8070 at 2017-02-04 15:07:59.263383559 +0000 UTC with {"metric":"job.exit_status","timesta$

gwoo commented 6 years ago

Should be fixed by #174.