Open mzeis opened 10 years ago
I also had this problem. I tried to set the 'status' column of this row in the database (table ) to 'error' to kinda force it to skip it and move onto the next job, however my cron now doesn't execute ANY tasks at all.
Any help here would be great.
Try to delete the job. If we do that the execution works again. Unfortunately I couldn't narrow down the problem yet. I'm considering reverting to an earlier version of Aoe_Scheduler or automatically deleting jobs running longer than x minutes.
Hi, we got the same problem here, but more frequently. Seems like the $schedule->runNow() function is aborting, after calling $this->tryLockJob(). Maybe after a timeout?! However I added a cleanup for this case based on the existing fallback cleanups in the checkRunningJobs function of the observer:
$schedules = Mage::getModel('cron/schedule')->getCollection()
->addFieldToFilter('status', Mage_Cron_Model_Schedule::STATUS_RUNNING)
->addFieldToFilter('last_seen', array('null' => true))
->addFieldToFilter('host', array('null' => true))
->addFieldToFilter('pid', array('null' => true))
->addFieldToFilter('scheduled_at', array('lt' => strftime('%Y-%m-%d %H:%M:00', $maxAge)))
->load();
foreach ($schedules->getIterator() as $schedule) {
/* @var $schedule Aoe_Scheduler_Model_Schedule */
$schedule->setLastSeen(strftime('%Y-%m-%d %H:%M:%S', time()));
$schedule->markAsDisappeared(sprintf('Process "%s" (id: %s) cannot be found anymore', $schedule->getJobCode(), $schedule->getId()));
}
Hi,
@fbrnc, can you consider including the cleanup code by @robinfritze? It would help if these jobs get cleaned up.
Committed (4787b39). Sorry for the delay and thanks @robinfritze for the contribution. Please test properly since I haven't create testdata to reproduce and test this odd situation :)
Hi. Thanks for committing. I recently updated the fix and removed the executed_at filter from the clause, since there were stuck processes with values at executed_at but not at last_seen. Sorry for not commenting on that. Can you include that fix?
Hi @robinfritze, sure I'd be happy to. Would you mind creating a quick pull request? That makes it so much easier :) Otherwise I can do it, but it will take a couple of extra days since I'll be out of office starting from this afternoon (PDT). Cheers, Fabrizio
Hi @fbrnc. I just created a pull request: https://github.com/AOEpeople/Aoe_Scheduler/pull/51
Hi @fbrnc,
do you know when the task_records
branch or the patch by @robinfritze will make it into a stable release? :-) We updated to EE 1.14.1.0 last week and experience this problem with the new core_email_queue_send_all
job daily since then.
Ciao Matthias
Hi Matthias,
@LeeSaferite and I have been discussing this today and we have some important changes/fixes that we want to get in before we merge and release a new version. Our plan is: Next Friday. And I'm committed to making this happen :)
That's awesome, thanks!
@mzeis: v1.0.0 release candidate is out. Let me know if the included patch works fine and this issue can be closed.
btw, core_email_queue_send_all
should rather be an always
task instead of */1 * * * *
. I suggest editing it and changing the schedule. (Don't know if that is related to this problem or not).
(@see https://github.com/AOEpeople/Aoe_Scheduler/blob/version1/doc/declutter-timeline.md)
@fbrnc: Thanks, I'll try that switching core_email_queue_send_all
. I've decided to separate groups for the always
calls core_email_queue_send_all
and enterprise_refresh_index
to avoid blocking between them. Is this good/bad/doesn't matter?
By the way: the fix by @robinfritze helped when I ported the changes back to the 0.x branch two weeks away so it should also work for 1.0.x.
Please keep in mind though that this fix only removes jobs that have been created but it doesn't help avoiding the creation of these jobs (i.e. it fights the symptoms, not the cause).
I'd like to leave the issue open for a bit if possible. When 1.0.0 is stable and we are running the new version in production I can tell you if the problem is gone in 1.0.x because we only experience the problem in production.
This change seems to be marking jobs that take longer than XML_PATH_MARK_AS_ERROR and are having their first run (no last_seen set yet) as gone - is there any other way you can determine that process is actually gone other than checking if last_seen is set?
Hi,
from time to time (about every 4 weeks) a job hangs on status "Running". It was never executed and thus has no executed, finished or last seen timestamp as well as no pid and host:
(Btw: the screenshot is particularly interesting as the job was created two hours in the future, see the Created at timestamp. This never has happened before, neither with this job "type" nor with any else. Now is the first time. We had hanging jobs with the "Running" status and correct timestamp before. Also, other jobs generated at the same time had the correct timestamp.)
The job doesn't get marked as failed, missed or something similar bei the cleanup procedures and thus blocks further executions of this job "type". There are no errors in the logs / reports.
We're using commit ddd9dafe2e88c49d01eee05b31f313ce4d5747df. The schedule is generated dynamically by a third-party extension by inserting XML nodes in the Magento XML config tree on-the-fly.
Please don't hesitate to ask if you need more details.