getredash / redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
http://redash.io/
BSD 2-Clause "Simplified" License
26.45k stars 4.38k forks source link

Periodic jobs are removed from Redis, which causes scheduled jobs to stop #5821

Open showwin opened 2 years ago

showwin commented 2 years ago

Issue Summary

The issue from the user's perspective is that scheduled jobs suddenly stop working. Scheduled jobs were running correctly until that time but suddenly stopped working.

From the technical perspective, some periodic jobs which are stored in Redis are removed for some reason that's why scheduled jobs stop working. I'm going to write it in detail below.

My Redash is running on a Kubernetes cluster.

Steps to Reproduce

This problem occurs irregularly. In my case, it happens every 10-30 days. Therefore it's difficult to reproduce intentionally. This happened to me nine times in total.

Technical details:

Redash Version: Using Docker the image redash/redash:10.1.0.b50633 How did you install Redash: Using Kubernetes to run containers

The expected periodic jobs which should be stored in Redis are like this:

for job in rq_scheduler.get_jobs():
    print(job)

<Job 18281a865ed3d1a60f366aeb8596fe2283aa421f: redash.tasks.general.sync_user_details()>
<Job 305c0cae0a196ae96915fd2b6f81001c435aad65: redash.tasks.queries.maintenance.refresh_queries()>
<Job 59a84668e68687338646f735965213c58e814b32: redash.tasks.queries.maintenance.remove_ghost_locks()>
<Job e27209059575fcc17c527c47d0957cb21756e551: redash.tasks.queries.maintenance.cleanup_query_results()>
<Job 52bcd40c254552539398db3cdb15055d4a9a536a: redash.tasks.queries.maintenance.refresh_schemas()>
<Job 75540cdf868e7873f5eec072177423b13b98dece: redash.tasks.queries.maintenance.empty_schedules()>
<Job 797f4f959c6c96c738efd445b17db75742db80a9: redash.tasks.failure_report.send_aggregated_errors()>
<Job 526f2c7f55dfe92457914cd7df02cfffe4dec877: redash.tasks.general.version_check()>

At least these six jobs should be stored, but when this issue occurs, the result of rq_scheduler.get_jobs() was:

for job in rq_scheduler.get_jobs():
    print(job)

<Job 52bcd40c254552539398db3cdb15055d4a9a536a: redash.tasks.queries.maintenance.refresh_schemas()>
<Job 75540cdf868e7873f5eec072177423b13b98dece: redash.tasks.queries.maintenance.empty_schedules()>
<Job 797f4f959c6c96c738efd445b17db75742db80a9: redash.tasks.failure_report.send_aggregated_errors()>
<Job 526f2c7f55dfe92457914cd7df02cfffe4dec877: redash.tasks.general.version_check()>

Some jobs were removed from rq_scheduler.

As far as I investigated, the root cause seems to be result_ttl parameter that is defined around here. As another person reported here, rq_scheduler doesn't recommend to use result_ttl parameter for a repeated job in its REAMDE.

IMPORTANT NOTE: If you set up a repeated job, you must make sure that you either do not set a result_ttl value or you set a value larger than the interval. Otherwise, the entry with the job details will expire and the job will not get re-scheduled.

The result_ttl was added to Redash codebase from the very beginning when Redash replace Celery with RQ (ref), and the longer result_ttl=600 was introduced by this PR to extend the default value. So I couldn't find strong reason why Redash uses result_ttl parameter.

How about removing result_ttl parameter or set result_ttl=-1 explicitly? If it sounds good, I'll create a PR with that fix.

P.S. I also searched the code which deletes a job from rq_scheduler, but the code is located only in the initializing process.

showwin commented 2 years ago

I tried to start running periodic jobs without result_ttl parameter in my Redash, so I will be able to report if the same issue happens after a month or so.

In [31]: for job in rq_scheduler.get_jobs():
    ...:     print(job, job.result_ttl)
    ...:
<Job f62bec30c67e00a7fc03337072b74227ac70c24b: redash.tasks.queries.maintenance.refresh_queries()> -1
<Job 27ccf7679d55fd368fa2a1a6262864c92ef411cf: redash.tasks.queries.maintenance.remove_ghost_locks()> -1
<Job b782f61249584a20ce2ce4e4c7fe13f09ca541ab: redash.tasks.general.sync_user_details()> -1
<Job 7f43bfe21b320bd6b3708d360909ebcfc2cd11c2: redash.tasks.queries.maintenance.cleanup_query_results()> -1
<Job 10d88ccbd46893f33fe792c7feb41534244a0b09: redash.tasks.queries.maintenance.refresh_schemas()> -1
<Job 5ae7d296b02dd520401aa5983db5b36b62828c6f: redash.tasks.failure_report.send_aggregated_errors()> -1
<Job 6d9d0f6047c92bb47d1a9895003f7c82f96533f0: redash.tasks.queries.maintenance.empty_schedules()> -1
<Job 75692428f53afeb43a549ad66fc3610c25f4d467: redash.tasks.general.version_check()> -1
showwin commented 2 years ago

How about removing result_ttl parameter or set result_ttl=-1 explicitly?

Almost a month has passed since I made the above change to my Redash, and I have had no problems 👍

yungene commented 3 weeks ago

I was having a similar issue and removing result_ttl parameter has fixed it. So like modifying this part of the source code: https://github.com/getredash/redash/blob/49277d27f8a8b17f541948b741539a612bfacc00/redash/tasks/schedule.py#L42-L50