Redis decode_responses rq incompatibility

SchrosCat2013 commented 1 year ago

Redash is setting the flag decode_responses for its Redis connection. This is incompatible with the rq library - https://github.com/rq/rq/issues/1188.

We are having some issues, and I suspect this might be the underlying cause. Primarily:

scheduled queries look like they're being executed multiple times
occasionally a query gets stuck. The job running the query disappears but the Redis entry remains, consuming a space in the query pool.

The second issue eventually consumes the entire query pool and prevents any query from running. When this happens, we have to manually flush the Redis cache.

We are running Redash v10.1.0 from the redash/redash:10.1.0.b50633 docker image.

We found https://github.com/getredash/redash/issues/5801, which looks like the same issue we are having. We have upgraded rq to 1.10.1 and rq-scheduler to 0.10.0 in our environment. This does seem to have alleviated the stuck queries issue. It doesn't seem to have changed the scheduled queries multiple executions. However, we are now seeing the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 1: invalid start byte repeated throughout the logs. The stack trace here matches the linked issue in the rq library. I am not sure if upgrading rq caused these errors - I more suspect that upgrading rq has improved the error logging and is surfacing an existing issue.

justinclift commented 1 year ago

Thanks, that looks like an important bug to investigate / fix. :smile:

justinclift commented 1 year ago

Our latest development code is using rq 1.5.0 and rq-scheduler 0.10.0, which both seem fairly old now.: https://github.com/getredash/redash/blob/c2e7df098dc267f35a2122c6cbab8ffb96feec21/requirements.txt#L47-L48

Latest versions are currently:

rq 1.15.1: https://pypi.org/project/rq/
rq-scheduler 0.13.1 : https://pypi.org/project/rq-scheduler/

We should be able to bump them up to much newer, and hopefully also adjust the connection options without too much hassle.

justinclift commented 1 year ago

This seems to be where decode_responses=True is added:

https://github.com/getredash/redash/blob/c2e7df098dc267f35a2122c6cbab8ffb96feec21/redash/settings/helpers.py#L47-L66

Looks like that was originally added during the migration of Redash from Python 2 to Python 3: https://github.com/getredash/redash/commit/246eca1121fad13fcb873de7d7c4e83345e41059

* Decode bytes returned from Redis

justinclift commented 1 year ago

Not setting that option might just be as simple as removing the helper function and removing the calls to it:

https://github.com/getredash/redash/blob/c2e7df098dc267f35a2122c6cbab8ffb96feec21/redash/settings/__init__.py#L9

https://github.com/getredash/redash/blob/c2e7df098dc267f35a2122c6cbab8ffb96feec21/redash/settings/__init__.py#L22

Am not finding any info about why the option was added in the first place. It was likely done for a reason, though what that might be I have no idea.

We can probably try removing that option then see if our CI tests pass. That will at least let us know that Redash might work without it. We'd ideally want to see it operate without causing errors in practise though.

justinclift commented 1 year ago

Ahhh. Looks like the rq docs specifically call out that option as unsupported these days too:

https://github.com/rq/rq/pull/1833/files

Yeah, lets try removing it. :smile:

justinclift commented 1 year ago

Looks like just removing that function completely isn't going to work as-is. When doing so just now in a test on my local computer, the backend unit tests start throwing errors:

========================================================================== short test summary info ==========================================================================
FAILED tests/test_models.py::QueryOutdatedQueriesTest::test_outdated_queries_works_scheduled_queries_tracker - AssertionError: Query(id=1, name='Query', query_hash='2d5127b8139f1f5f1086c4053324733a', version=1, user_id=1, org_id=1, data_source_id=1, query_hash='2d5127b8139f1f5f1...
FAILED tests/handlers/test_data_sources.py::TestDataSourcePausePost::test_pause_sets_reason - AssertionError: b'testing' != 'testing'
FAILED tests/handlers/test_query_results.py::TestQueryResultListAPI::test_execute_query_with_params - TypeError: can only concatenate str (not "bytes") to str
FAILED tests/models/test_data_sources.py::TestDataSourceIsPaused::test_allows_setting_reason - AssertionError: b'Some good reason.' != 'Some good reason.'
FAILED tests/models/test_users.py::TestUserDetail::test_sync - sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedFunction) operator does not exist: integer = bytea
FAILED tests/tasks/test_queries.py::TestEnqueueTask::test_multiple_enqueue_of_same_query - AttributeError: 'bytes' object has no attribute 'id'
FAILED tests/tasks/test_queries.py::TestEnqueueTask::test_reenqueue_during_job_cancellation - AttributeError: 'bytes' object has no attribute 'id'
FAILED tests/tasks/test_worker.py::TestWorkerMetrics::test_worker_records_success_metrics - TypeError: can only concatenate str (not "bytes") to str
FAILED tests/tasks/test_worker.py::TestQueueMetrics::test_enqueue_query_records_created_metric - TypeError: can only concatenate str (not "bytes") to str
==================================================== 9 failed, 787 passed, 1 skipped, 1175 warnings in 182.47s (0:03:02) ====================================================
make: *** [Makefile:40: backend-unit-tests] Error 1

The errors like can only concatenate str (not "bytes") to str should be solvable by casting the byte values to string first. However, I'm concerned there may be deeper issues from this approach. Let's see what others say...

@guidopetri Any interest in this one? :smile:

justinclift commented 1 year ago

Thinking about it more, the existing code base already has the connection to rq not set that option:

# _REDIS_URL is the unchanged REDIS_URL we get from env vars, to be used later with RQ
_REDIS_URL = os.environ.get("REDASH_REDIS_URL", os.environ.get("REDIS_URL", "redis://localhost:6379/0"))
# This is the one to use for Redash' own connection:
REDIS_URL = add_decode_responses_to_redis_url(_REDIS_URL)

So _REDIS_URL is what's apparently getting used for the connection to rq, and has decode_responses unset.

Whereas REDIS_URL (different to _REDIS_URL) has decode_responses set, but isn't used for connecting to rq.

So, my reading of things is that we should definitely try upgrading the rq and rq-scheduler libraries to see if it helps your problem. The decode_responses option probably isn't involved here after all, as the connection to rq doesn't use it.

Am testing the upgraded libraries on my local computer at the moment. If they pass our CI tests on my local computer, then I'll make a PR out of it. :smile:

justinclift commented 1 year ago

I've just created PR #6426, which upgrades the rq related dependencies as far as they can go without needing changes to our code base or tests.

@SchrosCat2013 Would you be ok to try adjusting your redis, rq, and rq-scheduler dependencies to match that, and see if it helps fix the problems?

And yeah, I'm aware you've been trying a slightly newer version of rq. That newer rq version isn't working 100% with our current tests, so might be causing some of the problem.

SchrosCat2013 commented 1 year ago

Thanks, I'll give that a go and let you know how it turns out

guidopetri commented 1 year ago

can only concatenate str (not "bytes") to str

This does indeed probably stem from Redis, but I agree that it's probably unrelated to the original error. AFAIK Redis by default returns bytes, and you have to decode with utf-8 to get str type, so that's what the error in the tests is complaining about.

SchrosCat2013 commented 1 year ago

Unfortunately, swapping to the suggested versions seems to have exacerbated the problem, if anything.

Part of this might be our environment - we're currently running an older version of Redis (v3.2.10). The suggested upgrade of redis-py puts this outside the list of supported versions. I'll give these library updates another go once we have upgraded Redis.

In the mean time I'm going to see if I can stand up and reproduce the issues in a Redash development environment.

justinclift commented 1 year ago

Interesting. Hopefully we can all get this solved sooner rather than later. :smile:

In the mean time I'm going to see if I can stand up and reproduce the issues in a Redash development environment.

For the dev environment setup, please use the wiki page documentation rather than the knowledge base pages:

https://github.com/getredash/redash/wiki/Local-development-setup

We need to update the knowledge base pages to include the wiki stuff, but haven't gotten around to it yet.

SchrosCat2013 commented 1 year ago

I've had a chance to look into this. I found one place where the wrong connection was being passed to an RQ function leading to the UnicodeDecodeError. https://github.com/getredash/redash/pull/6539

While looking for the problem, I noticed that the enqueue_query method is creating a Redis pipeline to watch a value, but the pipeline isn't being reset. I don't think this is related to the issue at hand, but seems like something that should be fixed. I've created a PR https://github.com/getredash/redash/pull/6540 that calls reset inside a finally block. The other option is to change line 39 - with redis_connection.pipeline() as pipe:, but that does mean another layer of nesting and the code is already quite nested in that function.

I believe the actual source of the problem is the HardLimitingWorker.

This is performing a heartbeat on the Worker, but not on the Job. Once a query is picked off the queue for execution, RQ is giving the job a default timeout of 60s. Normally this would be updated every time the job heartbeat is called. Nothing happens immediately after this timeout expires. The next time the RQ StartedJobRegistry.cleanup method is called, the job status will be set to failed. The job will continue running, and will still list the job (execute_query) on the Redash Admin RQ status page. However the query will not appear on the query tab of the Admin RQ status page. It will also look like it has stopped executing to a user who has the query open, possibly leading them to click refresh and start another copy of the query.

The StartedJobRegistry.cleanup method is called a result of calling get_job_ids, and somewhat ironically it looks like this happens in two places in Redash -

The remove_ghost_locks maintenance task, which is probably not actually progressing to this step because of the UnicodeDecodeError
The rq_status API call, used by the Admin RQ status page where you would go to see the list of running queries

Rather than adding a call to the job heartbeat from within the HardLimitingWorker, I believe the best fix in this case is to remove the HardLimitingWorker entirely. It looks like this was cloned from an early version of RQ, and its purpose is to hard-kill a job that runs for too long without respecting its time limit. This functionality has since been baked directly into RQ. It was added in version 1.2.1, and has undergone a number of iterations of bugfixes.

I have managed to reproduce this locally for ad-hoc queries. I'm not certain if this is also the cause of the scheduled queries running multiple times yet, but will give the fixes a go and get back to you with the results.

justinclift commented 1 year ago

Sounds like really well done investigation. Awesome. :smile:

justinclift commented 1 year ago

I believe the best fix in this case is to remove the HardLimitingWorker entirely. ...

Want to throw together a PR for this?

Also, any ideas on how we can definitively test the broken behaviour beforehand (smoking gun style), and test that it's fixed afterwards?

SchrosCat2013 commented 1 year ago

Happy to throw a PR together for this.

I'd have to give some thought about how you would go about testing this. As the broken behaviour is only exhibited once the RQ initial timeout of 60s has passed, you'd either have to have a test that waited 60s, messed with the internals of RQ, or directly modified the timeout for the job's underlying entry in Redis. None of those sound particularly appealing, though I could be persuaded on the third option.

justinclift commented 1 year ago

... or directly modified the timeout for the job's underlying entry in Redis.

Of the three, that one sounds most reasonable for our purposes here. But, lets ask @guidopetri as he's a lot better with Python than me. :smile:

guidopetri commented 1 year ago

imo waiting for 60s for a test isn't that big of a deal, I'd be fine with that. Messing with the internals of Redis sounds more error-prone and in my opinion doesn't give enough lift that it's worth doing instead.

pavelbrylov commented 2 weeks ago

Hey, @justinclift @guidopetri it seems that @SchrosCat2013 merged the changes in his fork, would this be possible to add it to upstream as well? We seem to experience exactly this issue after upgrading to the latest version from 10.1.0 - adhoc queries which normally run longer than 1 min get status 4 in job status response, so UI stops polling for results

{
    "job": {
        "id": "b9b013b8-b9ee-4e29-a2a1-a49dd28a85b7",
        "updated_at": 0,
        "status": 4,
        "error": "",
        "result": null,
        "query_result_id": null
    }
}

I can see in redis that this job indeed has "failed" status, but since worker is still running, eventually it becomes completed. Here is excerpt from redis:

> hgetall rq:job:d43b2a5c-12b6-43e4-992a-3a72a41a8a0c
 1) "created_at"
 2) "2024-10-18T10:22:00.880042Z"
 5) "status"
 6) "failed"
 7) "enqueued_at"
 8) "2024-10-18T10:22:00.880157Z"
 9) "worker_name"
10) "c642c94ddcc84b12b47c3a632a7e58a9"
11) "failure_ttl"
12) "604800"
15) "failure_callback_name"
16) ""
17) "origin"
18) "queries"
21) "success_callback_name"
22) ""
23) "last_heartbeat"
24) "2024-10-18T10:22:00.907381Z"
25) "timeout"
26) "-1"
27) "stopped_callback_name"
28) ""
31) "started_at"
32) "2024-10-18T10:22:00.907381Z"
33) "result_ttl"
34) "43200"
35) "ended_at"
36) ""

> hgetall rq:job:d43b2a5c-12b6-43e4-992a-3a72a41a8a0c
 1) "created_at"
 2) "2024-10-18T10:22:00.880042Z"
 5) "status"
 6) "finished"
 7) "enqueued_at"
 8) "2024-10-18T10:22:00.880157Z"
 9) "worker_name"
10) "c642c94ddcc84b12b47c3a632a7e58a9"
11) "failure_ttl"
12) "604800"
15) "failure_callback_name"
16) ""
17) "origin"
18) "queries"
21) "success_callback_name"
22) ""
23) "last_heartbeat"
24) "2024-10-18T10:33:22.015917Z"
25) "timeout"
26) "-1"
27) "stopped_callback_name"
28) ""
31) "started_at"
32) "2024-10-18T10:22:00.945679Z"
33) "result_ttl"
34) "43200"
35) "ended_at"
36) "2024-10-18T10:33:22.015915Z"

Thanks in advance!

getredash / redash

Redis decode_responses rq incompatibility #6424