AOEpeople / Aoe_Scheduler

Cron Scheduler Module for Magento
http://fbrnc.net/blog/2011/03/magento-cron-schedule
GNU General Public License v3.0
373 stars 202 forks source link

Cronjob which was never executed hanging on status "Running" #40

Open mzeis opened 10 years ago

mzeis commented 10 years ago

Hi,

from time to time (about every 4 weeks) a job hangs on status "Running". It was never executed and thus has no executed, finished or last seen timestamp as well as no pid and host:

cronjob_timeline_view_2014-09-22_1231

(Btw: the screenshot is particularly interesting as the job was created two hours in the future, see the Created at timestamp. This never has happened before, neither with this job "type" nor with any else. Now is the first time. We had hanging jobs with the "Running" status and correct timestamp before. Also, other jobs generated at the same time had the correct timestamp.)

The job doesn't get marked as failed, missed or something similar bei the cleanup procedures and thus blocks further executions of this job "type". There are no errors in the logs / reports.

We're using commit ddd9dafe2e88c49d01eee05b31f313ce4d5747df. The schedule is generated dynamically by a third-party extension by inserting XML nodes in the Magento XML config tree on-the-fly.

Please don't hesitate to ask if you need more details.

montychristo commented 10 years ago

I also had this problem. I tried to set the 'status' column of this row in the database (table ) to 'error' to kinda force it to skip it and move onto the next job, however my cron now doesn't execute ANY tasks at all.

screenshot 2014-09-29 11 55 08

Any help here would be great.

mzeis commented 10 years ago

Try to delete the job. If we do that the execution works again. Unfortunately I couldn't narrow down the problem yet. I'm considering reverting to an earlier version of Aoe_Scheduler or automatically deleting jobs running longer than x minutes.

robinfritze commented 9 years ago

Hi, we got the same problem here, but more frequently. Seems like the $schedule->runNow() function is aborting, after calling $this->tryLockJob(). Maybe after a timeout?! However I added a cleanup for this case based on the existing fallback cleanups in the checkRunningJobs function of the observer:

        $schedules = Mage::getModel('cron/schedule')->getCollection()
            ->addFieldToFilter('status', Mage_Cron_Model_Schedule::STATUS_RUNNING)
            ->addFieldToFilter('last_seen', array('null' => true))
            ->addFieldToFilter('host', array('null' => true))
            ->addFieldToFilter('pid', array('null' => true))
            ->addFieldToFilter('scheduled_at', array('lt' => strftime('%Y-%m-%d %H:%M:00', $maxAge)))
            ->load();

        foreach ($schedules->getIterator() as $schedule) {
            /* @var $schedule Aoe_Scheduler_Model_Schedule */
            $schedule->setLastSeen(strftime('%Y-%m-%d %H:%M:%S', time()));
            $schedule->markAsDisappeared(sprintf('Process "%s" (id: %s) cannot be found anymore', $schedule->getJobCode(), $schedule->getId()));
        }
mzeis commented 9 years ago

Hi,

@fbrnc, can you consider including the cleanup code by @robinfritze? It would help if these jobs get cleaned up.

fbrnc commented 9 years ago

Committed (4787b39). Sorry for the delay and thanks @robinfritze for the contribution. Please test properly since I haven't create testdata to reproduce and test this odd situation :)

robinfritze commented 9 years ago

Hi. Thanks for committing. I recently updated the fix and removed the executed_at filter from the clause, since there were stuck processes with values at executed_at but not at last_seen. Sorry for not commenting on that. Can you include that fix?

fbrnc commented 9 years ago

Hi @robinfritze, sure I'd be happy to. Would you mind creating a quick pull request? That makes it so much easier :) Otherwise I can do it, but it will take a couple of extra days since I'll be out of office starting from this afternoon (PDT). Cheers, Fabrizio

robinfritze commented 9 years ago

Hi @fbrnc. I just created a pull request: https://github.com/AOEpeople/Aoe_Scheduler/pull/51

mzeis commented 9 years ago

Hi @fbrnc,

do you know when the task_records branch or the patch by @robinfritze will make it into a stable release? :-) We updated to EE 1.14.1.0 last week and experience this problem with the new core_email_queue_send_all job daily since then.

Ciao Matthias

fbrnc commented 9 years ago

Hi Matthias,

@LeeSaferite and I have been discussing this today and we have some important changes/fixes that we want to get in before we merge and release a new version. Our plan is: Next Friday. And I'm committed to making this happen :)

mzeis commented 9 years ago

That's awesome, thanks!

fbrnc commented 9 years ago

@mzeis: v1.0.0 release candidate is out. Let me know if the included patch works fine and this issue can be closed.

btw, core_email_queue_send_all should rather be an always task instead of */1 * * * *. I suggest editing it and changing the schedule. (Don't know if that is related to this problem or not).

(@see https://github.com/AOEpeople/Aoe_Scheduler/blob/version1/doc/declutter-timeline.md)

mzeis commented 9 years ago

@fbrnc: Thanks, I'll try that switching core_email_queue_send_all. I've decided to separate groups for the always calls core_email_queue_send_all and enterprise_refresh_index to avoid blocking between them. Is this good/bad/doesn't matter?

mzeis commented 9 years ago

By the way: the fix by @robinfritze helped when I ported the changes back to the 0.x branch two weeks away so it should also work for 1.0.x.

Please keep in mind though that this fix only removes jobs that have been created but it doesn't help avoiding the creation of these jobs (i.e. it fights the symptoms, not the cause).

I'd like to leave the issue open for a bit if possible. When 1.0.0 is stable and we are running the new version in production I can tell you if the problem is gone in 1.0.x because we only experience the problem in production.

drobinson commented 9 years ago

This change seems to be marking jobs that take longer than XML_PATH_MARK_AS_ERROR and are having their first run (no last_seen set yet) as gone - is there any other way you can determine that process is actually gone other than checking if last_seen is set?