Parallel cronjobs spawned without configuring

KaiserKarel commented 6 years ago

Today I had django-cron initiate a runcron, and during that runtime, start another instance and another one. Seems like there is something wrong with the cache lock?

Looking at my cronjob logs, it could also be that django-cron ended the job, before the actual end of the job.

My code looks something like this:

from django_cron import CronJobBase, Schedule
import functions

class Foo(CronJobBase):
    RUN_EVERY_MINS = 360
    schedule = Schedule(run_every_mins=RUN_EVERY_MINS)
    code = 'bar.foo'

    def do():
        do_something_for_5_mins()
        do_another_thing_for_1_hour()

It seems that after do_something_for_5_mins(), django-cron removed the lock and allowed another instance to run?

I thought I'd attempt a custom lock:

@transaction.atomic
def set_lock_payment_run(name):
    try:
        lock = app.models.Lock.objects.select_for_update().filter(job_name=name, lock=False).update(lock=True)
        if not lock:
            logger.fatal(' lock was True while cron run was initiated.')
            raise ConcurrencyError

and

@transaction.atomic
def release_lock_payment_run(name):
    lock = app.models.Lock.objects.select_for_update().filter(job_name=name, lock=True).update(lock=False)
    if not lock:
        logger.fatal('lock was False while cron run was ended.')
        raise ConcurrencyError

These locks can be used to avoid parallel runs

KaiserKarel commented 6 years ago

So last night I django-cron tried to to a couple of consecutive runs with my lock still turned on. If you've got jobs that absolutely cannot run in parallel, I recommend writing your own lock.

petrprikryl commented 6 years ago

Same was happening to me. It looks like we had small maxmemory in redis (1GB) and locks were evicted because of full used_memory.

OndraRehounek commented 6 years ago

I am not sure if I understood @AbeIgnocious 's solution right but I was missing the option that the Lock have not been created yet, so I solved this way:

functions.py

@transaction.atomic
def set_lock(name):
    # lock = CronJobLock.objects.select_for_update().filter(job_name=name, locked=False).update(locked=True)
    lock, created = CronJobLock.objects.get_or_create(job_name=name)    # default 'locked' attribute is False
    if lock.locked:
        print 'ERROR: Lock was True while cron run was initiated.'
        raise RuntimeError('Lock was True while cron run was initiated.')
    else:
        lock.locked = True
        lock.save()
        print "Lock locked on " + name

@transaction.atomic
def release_lock(name):
    lock = CronJobLock.objects.select_for_update().filter(job_name=name, locked=True).update(locked=False)
    if not lock:
        print 'WARNING: Lock was False while cron run was ended.'
        raise RuntimeError('Lock was False while cron run was ended.')
    else:
        print "Lock released on " + name

models.py

class CronJobLock(models.Model):
    created_at = models.DateTimeField(auto_now_add=True)
    updated_at = models.DateTimeField(auto_now=True)
    job_name = models.CharField(max_length=64, unique=True)
    locked = models.BooleanField(default=False)

I am not sure why this is necessary but if I got the documention correctly, Django Cache must be enabled for the project and that is not our case so I used this walkaround. I have to call set and release functions at the start and at the end of every cron job that cannot run parallely.

agronick commented 6 years ago

Appears something broke in a recent update. We sent dozens of emails to the same users telling them their account would expire.

kukosk commented 6 years ago

What caching backend are you guys using? It could be that you're using LocMemCache which is per-process.

kaulgud commented 5 years ago

I ran into a similar issue recently. In our case the job gets triggered from multiple web servers on production exactly at the same time. Our caching backend is django.core.cache.backends.db.DatabaseCache and all web servers share a common cache database (SQL Server). The issue with DatabaseCache is that it doesn't have database level unique key constraints for cache key. Django handles cache key uniqueness through code so we often ran into race condition.

We ended up creating a custom lock.

class CustomCronJobLock(DjangoCronJobLock):

    def lock(self):
        try:
            with transaction.atomic():
                CronJobLock.objects.create(job_name=self.job_name)
                return True
        except:
            return False

    def release(self):
        with transaction.atomic():
            CronJobLock.objects.filter(job_name=self.job_name).delete()

class CronJobLock(models.Model): job_name = models.CharField(max_length=255, primary_key=True)

job_name column has primary key constraint in the database.

Tivix / django-cron

Parallel cronjobs spawned without configuring #133