Closed neogenie closed 9 months ago
@aembke
What does redis.get_replicas::<_, u64>("{KEY}").await
do? Is that just shorthand for client.replicas().get("...").await
?
Maybe unrelated to the recovery issues - you may also want to look into https://redis.io/commands/wait/ if you're writing to primaries and then quickly turning around and reading those values from replicas.
Either way thanks for the repro and I'll take a look soon.
Hi @aembke !
Yes, it's simple wrapper:
pub async fn get_replicas<K, V>(&self, key: K) -> Result<Option<V>, RedisError>
where
K: Into<RedisKey> + Clone + Send,
V: 'static + FromRedis + Unpin + Send,
{
trace!("RedisConnector::get_replicas()");
self.pool.replicas().get(key.clone()).await
}
Maybe unrelated to the recovery issues - you may also want to look into https://redis.io/commands/wait/ if you're writing to primaries and then quickly turning around and reading those values from replicas.
Exactly unrelated. :) This code is for example only - read and write.
Either way thanks for the repro and I'll take a look soon.
Thank You!
Stopping (stop
, not pause
) the replica has more dramatic consequences. The recovery procedure takes more than 10 minutes (!!!) and it doesn’t look like it will ever be completed...
2024-02-01T11:38:43.927741Z INFO test_redis: +[OK] 1000/0 Took 47.428167ms
2024-02-01T11:38:44.275736Z WARN fred::router::responses: fred-lHDr936vur: Ending replica reader task from 127.0.0.1:7005 due to None
2024-02-01T11:38:44.281920Z WARN fred::router::responses: fred-LLOQom3aks: Ending replica reader task from 127.0.0.1:7005 due to None
2024-02-01T11:38:44.281990Z WARN fred::router::responses: fred-eQbTM2H9wT: Ending replica reader task from 127.0.0.1:7005 due to None
2024-02-01T11:38:44.282262Z WARN fred::router::responses: fred-9ArMyREOjA: Ending replica reader task from 127.0.0.1:7005 due to None
2024-02-01T11:38:44.282407Z WARN fred::router::responses: fred-B0KaZhe4jU: Ending replica reader task from 127.0.0.1:7005 due to None
2024-02-01T11:38:44.282558Z WARN fred::router::responses: fred-rLjpIOhB3W: Ending replica reader task from 127.0.0.1:7005 due to None
2024-02-01T11:39:14.450113Z ERROR test_redis: -[ERR] 715/285 Took 30.069008s
2024-02-01T11:39:44.466894Z ERROR test_redis: -[ERR] 0/1000 Took 30.015474667s
2024-02-01T11:40:14.480707Z ERROR test_redis: -[ERR] 0/1000 Took 30.013087916s
2024-02-01T11:40:44.494171Z ERROR test_redis: -[ERR] 0/1000 Took 30.012756333s
2024-02-01T11:41:14.534803Z ERROR test_redis: -[ERR] 0/1000 Took 30.01477575s
2024-02-01T11:41:44.553189Z ERROR test_redis: -[ERR] 0/1000 Took 30.013585417s
2024-02-01T11:42:14.565962Z ERROR test_redis: -[ERR] 0/1000 Took 30.011364542s
2024-02-01T11:42:44.580175Z ERROR test_redis: -[ERR] 0/1000 Took 30.013245708s
2024-02-01T11:43:14.595038Z ERROR test_redis: -[ERR] 0/1000 Took 30.01393975s
2024-02-01T11:43:52.643292Z ERROR test_redis: -[ERR] 20/980 Took 38.047119417s
2024-02-01T11:44:42.708879Z ERROR test_redis: -[ERR] 163/837 Took 50.064030334s
2024-02-01T11:45:12.745757Z ERROR test_redis: -[ERR] 85/915 Took 30.035667875s
2024-02-01T11:45:42.764750Z ERROR test_redis: -[ERR] 136/864 Took 30.017902125s
2024-02-01T11:46:12.787971Z ERROR test_redis: -[ERR] 127/873 Took 30.022322584s
2024-02-01T11:46:42.807797Z ERROR test_redis: -[ERR] 110/890 Took 30.018946375s
2024-02-01T11:47:12.829260Z ERROR test_redis: -[ERR] 79/921 Took 30.020552875s
2024-02-01T11:47:42.849937Z ERROR test_redis: -[ERR] 99/901 Took 30.019773458s
2024-02-01T11:48:12.868937Z ERROR test_redis: -[ERR] 143/857 Took 30.018112042s
2024-02-01T11:48:42.890067Z ERROR test_redis: -[ERR] 113/887 Took 30.020275458s
This should be fixed in https://github.com/aembke/fred.rs/pull/213.
In my repro the cluster took about 20 seconds to notice and promote the replica node, so make sure to tune the timeout and retry attempt counts accordingly. The values I used are here: https://github.com/aembke/fred.rs/blob/62eef17be62735d48c3ec114ef219ba852f5fe45/bin/replica_consistency/src/main.rs#L89-L124
Hi, @aembke ! Great job! Thank You!
Looks like issue is fixed. BUT failover procedure is broken again...
2024-02-05T06:16:49.297639Z INFO test_redis: +[OK] 1000/0 Took 51.926417ms
2024-02-05T06:16:54.829474Z WARN fred::router::utils: fred-YEIuoVbipl: Unresponsive connection to 127.0.0.1:7006 after 5.029120292s
2024-02-05T06:16:54.829474Z WARN fred::router::utils: fred-ETIAqyuCud: Unresponsive connection to 127.0.0.1:7006 after 5.028842167s
2024-02-05T06:16:54.829716Z WARN fred::router::utils: fred-koORDGrZKf: Unresponsive connection to 127.0.0.1:7006 after 5.029001917s
2024-02-05T06:16:54.830613Z WARN fred::router::responses: fred-YEIuoVbipl: Ending reader task from 127.0.0.1:7006 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)
2024-02-05T06:16:54.831580Z WARN fred::router::responses: fred-ETIAqyuCud: Ending reader task from 127.0.0.1:7006 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)
2024-02-05T06:16:54.832018Z WARN fred::router::utils: fred-7idWtPr5WG: Unresponsive connection to 127.0.0.1:7006 after 5.029236042s
2024-02-05T06:16:54.832036Z WARN fred::router::responses: fred-koORDGrZKf: Ending reader task from 127.0.0.1:7006 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)
2024-02-05T06:16:54.832751Z WARN fred::router::responses: fred-7idWtPr5WG: Ending reader task from 127.0.0.1:7006 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)
2024-02-05T06:16:54.833551Z WARN fred::router::utils: fred-HI2RLD29uA: Unresponsive connection to 127.0.0.1:7006 after 5.029117166s
2024-02-05T06:16:54.833886Z WARN fred::router::responses: fred-HI2RLD29uA: Ending reader task from 127.0.0.1:7006 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)
2024-02-05T06:16:54.841037Z WARN fred::router::utils: fred-EONp3Bs9YS: Unresponsive connection to 127.0.0.1:7006 after 5.041559583s
2024-02-05T06:16:54.841288Z WARN fred::router::responses: fred-EONp3Bs9YS: Ending reader task from 127.0.0.1:7006 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)
2024-02-05T06:16:54.891370Z WARN fred::router::responses: fred-koORDGrZKf: Ending reader task from 127.0.0.1:7002 due to None
2024-02-05T06:16:54.892535Z WARN fred::router::responses: fred-koORDGrZKf: Ending reader task from 127.0.0.1:7001 due to None
2024-02-05T06:16:54.904923Z WARN fred::router::responses: fred-EONp3Bs9YS: Ending reader task from 127.0.0.1:7001 due to None
2024-02-05T06:16:54.905680Z WARN fred::router::responses: fred-EONp3Bs9YS: Ending reader task from 127.0.0.1:7002 due to None
2024-02-05T06:16:54.905824Z WARN fred::router::responses: fred-YEIuoVbipl: Ending reader task from 127.0.0.1:7001 due to None
2024-02-05T06:16:54.906428Z WARN fred::router::responses: fred-YEIuoVbipl: Ending reader task from 127.0.0.1:7002 due to None
2024-02-05T06:16:54.915022Z WARN fred::router::responses: fred-ETIAqyuCud: Ending reader task from 127.0.0.1:7001 due to None
2024-02-05T06:16:54.915420Z WARN fred::router::responses: fred-ETIAqyuCud: Ending reader task from 127.0.0.1:7002 due to None
2024-02-05T06:16:54.959773Z WARN fred::router::responses: fred-HI2RLD29uA: Ending reader task from 127.0.0.1:7002 due to None
2024-02-05T06:16:54.960130Z WARN fred::router::responses: fred-HI2RLD29uA: Ending reader task from 127.0.0.1:7001 due to None
2024-02-05T06:16:54.969813Z WARN fred::router::responses: fred-7idWtPr5WG: Ending reader task from 127.0.0.1:7002 due to None
2024-02-05T06:16:54.970270Z WARN fred::router::responses: fred-7idWtPr5WG: Ending reader task from 127.0.0.1:7001 due to None
2024-02-05T06:16:57.311220Z INFO test_redis: +[OK] 1000/0 Took 7.563554666s
2024-02-05T06:16:57.345472Z INFO test_redis: +[OK] 1000/0 Took 34.185125ms
2024-02-05T06:16:57.375784Z INFO test_redis: +[OK] 1000/0 Took 30.25375ms
2024-02-05T06:16:57.406351Z INFO test_redis: +[OK] 1000/0 Took 30.506083ms
2024-02-05T06:16:57.440941Z INFO test_redis: +[OK] 1000/0 Took 34.528125ms
2024-02-05T06:16:57.476054Z INFO test_redis: +[OK] 1000/0 Took 35.057291ms
2024-02-05T06:16:57.505697Z INFO test_redis: +[OK] 1000/0 Took 29.588584ms
2024-02-05T06:16:57.536814Z INFO test_redis: +[OK] 1000/0 Took 31.061334ms
2024-02-05T06:16:57.562554Z INFO test_redis: +[OK] 1000/0 Took 25.682541ms
2024-02-05T06:16:57.589706Z INFO test_redis: +[OK] 1000/0 Took 27.090875ms
2024-02-05T06:16:57.617016Z INFO test_redis: +[OK] 1000/0 Took 27.244083ms
2024-02-05T06:16:57.645447Z INFO test_redis: +[OK] 1000/0 Took 28.374875ms
2024-02-05T06:16:57.672728Z INFO test_redis: +[OK] 1000/0 Took 27.205ms
2024-02-05T06:16:57.697096Z INFO test_redis: +[OK] 1000/0 Took 24.306917ms
2024-02-05T06:16:57.722746Z INFO test_redis: +[OK] 1000/0 Took 25.587ms
2024-02-05T06:16:57.748388Z INFO test_redis: +[OK] 1000/0 Took 25.583375ms
2024-02-05T06:16:57.776261Z INFO test_redis: +[OK] 1000/0 Took 27.812375ms
2024-02-05T06:16:58.292781Z INFO test_redis: +[OK] 1000/0 Took 45.68775ms
2024-02-05T06:16:58.788668Z INFO test_redis: +[OK] 1000/0 Took 42.563666ms
2024-02-05T06:16:59.280182Z INFO test_redis: +[OK] 1000/0 Took 34.642209ms
2024-02-05T06:16:59.793809Z INFO test_redis: +[OK] 1000/0 Took 48.0295ms
2024-02-05T06:17:00.293328Z INFO test_redis: +[OK] 1000/0 Took 47.276042ms
2024-02-05T06:17:00.795847Z INFO test_redis: +[OK] 1000/0 Took 46.226875ms
2024-02-05T06:17:01.290629Z INFO test_redis: +[OK] 1000/0 Took 44.24875ms
2024-02-05T06:17:01.796743Z INFO test_redis: +[OK] 1000/0 Took 51.494084ms
2024-02-05T06:17:02.298013Z INFO test_redis: +[OK] 1000/0 Took 51.381917ms
2024-02-05T06:17:02.794077Z INFO test_redis: +[OK] 1000/0 Took 45.53975ms
2024-02-05T06:17:03.292757Z INFO test_redis: +[OK] 1000/0 Took 45.195833ms
2024-02-05T06:17:03.788030Z INFO test_redis: +[OK] 1000/0 Took 41.859791ms
2024-02-05T06:17:04.297260Z INFO test_redis: +[OK] 1000/0 Took 52.14725ms
2024-02-05T06:17:04.811522Z INFO test_redis: +[OK] 1000/0 Took 55.847625ms
2024-02-05T06:17:05.312864Z INFO test_redis: +[OK] 1000/0 Took 66.387709ms
2024-02-05T06:17:05.815741Z INFO test_redis: +[OK] 1000/0 Took 64.425625ms
2024-02-05T06:17:06.300117Z INFO test_redis: +[OK] 1000/0 Took 54.191417ms
2024-02-05T06:17:06.807931Z INFO test_redis: +[OK] 1000/0 Took 61.80575ms
2024-02-05T06:17:07.306839Z INFO test_redis: +[OK] 1000/0 Took 59.837208ms
2024-02-05T06:17:07.809151Z INFO test_redis: +[OK] 1000/0 Took 60.88275ms
2024-02-05T06:17:08.300369Z INFO test_redis: +[OK] 1000/0 Took 51.3125ms
2024-02-05T06:17:08.798675Z INFO test_redis: +[OK] 1000/0 Took 52.130667ms
2024-02-05T06:17:09.300353Z INFO test_redis: +[OK] 1000/0 Took 53.593417ms
2024-02-05T06:17:09.792797Z INFO test_redis: +[OK] 1000/0 Took 47.176667ms
2024-02-05T06:17:10.298523Z INFO test_redis: +[OK] 1000/0 Took 52.714083ms
2024-02-05T06:17:10.799380Z INFO test_redis: +[OK] 1000/0 Took 52.777583ms
2024-02-05T06:17:11.299379Z INFO test_redis: +[OK] 1000/0 Took 51.69ms
2024-02-05T06:17:11.800617Z INFO test_redis: +[OK] 1000/0 Took 54.530583ms
2024-02-05T06:17:12.293402Z INFO test_redis: +[OK] 1000/0 Took 48.393708ms
2024-02-05T06:17:12.800638Z INFO test_redis: +[OK] 1000/0 Took 54.106834ms
2024-02-05T06:17:13.294047Z INFO test_redis: +[OK] 1000/0 Took 47.259541ms
2024-02-05T06:17:13.799640Z INFO test_redis: +[OK] 1000/0 Took 53.710583ms
2024-02-05T06:17:14.298008Z INFO test_redis: +[OK] 1000/0 Took 52.323167ms
2024-02-05T06:17:14.810454Z INFO test_redis: +[OK] 1000/0 Took 62.208584ms
2024-02-05T06:17:15.290926Z INFO test_redis: +[OK] 1000/0 Took 44.847334ms
2024-02-05T06:17:15.794860Z INFO test_redis: +[OK] 1000/0 Took 49.138834ms
2024-02-05T06:17:16.284327Z INFO test_redis: +[OK] 1000/0 Took 38.358834ms
2024-02-05T06:17:16.807748Z INFO test_redis: +[OK] 1000/0 Took 61.973375ms
2024-02-05T06:17:17.305323Z INFO test_redis: +[OK] 1000/0 Took 58.564125ms
2024-02-05T06:17:17.792644Z INFO test_redis: +[OK] 1000/0 Took 47.008584ms
2024-02-05T06:17:18.288805Z INFO test_redis: +[OK] 1000/0 Took 42.868792ms
2024-02-05T06:17:18.778868Z INFO test_redis: +[OK] 1000/0 Took 32.866958ms
2024-02-05T06:17:19.294915Z INFO test_redis: +[OK] 1000/0 Took 48.823875ms
2024-02-05T06:17:19.794890Z INFO test_redis: +[OK] 1000/0 Took 48.395791ms
2024-02-05T06:17:20.294193Z INFO test_redis: +[OK] 1000/0 Took 48.277792ms
2024-02-05T06:17:20.792790Z INFO test_redis: +[OK] 1000/0 Took 47.009542ms
2024-02-05T06:17:21.294170Z INFO test_redis: +[OK] 1000/0 Took 49.004208ms
2024-02-05T06:17:21.792329Z INFO test_redis: +[OK] 1000/0 Took 47.100459ms
2024-02-05T06:17:22.288217Z INFO test_redis: +[OK] 1000/0 Took 41.982458ms
2024-02-05T06:17:22.795953Z INFO test_redis: +[OK] 1000/0 Took 50.392041ms
2024-02-05T06:17:23.327127Z INFO test_redis: +[OK] 1000/0 Took 78.193875ms
2024-02-05T06:17:23.789954Z INFO test_redis: +[OK] 1000/0 Took 43.816333ms
2024-02-05T06:17:24.297945Z INFO test_redis: +[OK] 1000/0 Took 50.638292ms
2024-02-05T06:17:24.812503Z INFO test_redis: +[OK] 1000/0 Took 64.797084ms
2024-02-05T06:17:25.293402Z INFO test_redis: +[OK] 1000/0 Took 46.272958ms
2024-02-05T06:17:25.791725Z INFO test_redis: +[OK] 1000/0 Took 46.1095ms
2024-02-05T06:17:26.299810Z INFO test_redis: +[OK] 1000/0 Took 53.743166ms
2024-02-05T06:17:26.797863Z INFO test_redis: +[OK] 1000/0 Took 52.334166ms
2024-02-05T06:17:27.301204Z INFO test_redis: +[OK] 1000/0 Took 53.56725ms
2024-02-05T06:17:27.795943Z INFO test_redis: +[OK] 1000/0 Took 49.924542ms
2024-02-05T06:17:28.311294Z INFO test_redis: +[OK] 1000/0 Took 65.154166ms
2024-02-05T06:17:28.801559Z INFO test_redis: +[OK] 1000/0 Took 54.586166ms
2024-02-05T06:17:29.291364Z INFO test_redis: +[OK] 1000/0 Took 45.245584ms
2024-02-05T06:17:29.807761Z INFO test_redis: +[OK] 1000/0 Took 61.308292ms
2024-02-05T06:17:30.296979Z INFO test_redis: +[OK] 1000/0 Took 50.437125ms
2024-02-05T06:17:35.784839Z WARN fred::router::responses: fred-7idWtPr5WG: Ending reader task from 127.0.0.1:7005 due to None
2024-02-05T06:17:35.785277Z WARN fred::router::responses: fred-YEIuoVbipl: Ending reader task from 127.0.0.1:7005 due to None
2024-02-05T06:17:35.785878Z WARN fred::router::responses: fred-koORDGrZKf: Ending reader task from 127.0.0.1:7005 due to None
2024-02-05T06:17:35.787715Z WARN fred::router::responses: fred-ETIAqyuCud: Ending reader task from 127.0.0.1:7005 due to None
2024-02-05T06:17:35.789610Z WARN fred::router::responses: fred-EONp3Bs9YS: Ending reader task from 127.0.0.1:7005 due to None
2024-02-05T06:17:35.794223Z WARN fred::router::responses: fred-HI2RLD29uA: Ending reader task from 127.0.0.1:7005 due to None
2024-02-05T06:18:25.831530Z ERROR test_redis: -[ERR] 0/1000 Took 55.084402084s
2024-02-05T06:18:55.851215Z ERROR test_redis: -[ERR] 0/1000 Took 30.019148125s
2024-02-05T06:19:25.866729Z ERROR test_redis: -[ERR] 0/1000 Took 30.015163875s
2024-02-05T06:19:55.890015Z ERROR test_redis: -[ERR] 0/1000 Took 30.022939083s
2024-02-05T06:20:25.912598Z ERROR test_redis: -[ERR] 0/1000 Took 30.022254958s
2024-02-05T06:20:55.944116Z ERROR test_redis: -[ERR] 0/1000 Took 30.031181583s
2024-02-05T06:21:25.961834Z ERROR test_redis: -[ERR] 0/1000 Took 30.017383417s
2024-02-05T06:21:55.904257Z ERROR test_redis: -[ERR] 0/1000 Took 30.011577542s
Scenario:
CLUSTER FAILOVER
to return 127.0.0.1:7006 master roleWhat error message are you getting on those commands? I was able to repro the last one but this works for me on main
currently. Are you still using the same configuration values from your original post?
I'm using https://github.com/aembke/fred.rs/tree/main/bin/replica_consistency to test this. Are you able to repro the issue there by chance? It uses foo
as the key, which always maps to redis-cluster-3
first, then fails over to redis-cluster-4
.
If you use that you can get a redis-cli
session going inside the docker network like this:
cd path/to/fred
source ./tests/environ
./tests/runners/docker-bash.sh
# need this again in the container, but respond with yes to build redis-cli
source ./tests/environ
fred_redis_cli -a bar -p 30001 -h redis-cluster-3
Default config.cluster_cache_update_delay = 0
seems to fix everything.
100ms leads to a bunch of errors, but still recovery, 5s - to complete degradation of the library, as in the previous post.
What are those errors? Timeouts?
Timeouts?
Yes, 30s config.default_command_timeout
Sounds good. Just wanted to make sure those commands weren't being dropped unexpectedly.
I'll try to get 8.0.2 published this week.
Hi, @aembke !
Any news?)
Just published 8.0.2, so the initial thing you reported here should be fixed.
I'll need to do a bit more thinking about cluster_cache_update_delay
. Initially in the 8.0.1 fix I mentioned that I had to tune this to 5s to make it work, but the reasoning for this change was incorrect at the time.
I set that to 5s in the 8.0.1 repro to reduce the number of write attempts, since that was tuned to be about 3 at the time. If the cluster takes ~20 sec to fail over, and we need to rely on write+timeout tricks to detect a paused container, then we'd need at least ~10 attempts with an exponential backoff starting at 100ms. The 5s cluster sync delay config change was an attempt to make this work without having to change the max write attempts to some big number. This was the wrong approach though since it requires knowing how these two config options are used internally.
In this latest 8.0.2 patch I changed the retry logic to specifically not increment the write attempt counter when we're first testing replica connectivity after reconnecting to a cluster node. This should fix the tension between the max_write_attempts
and cluster_cache_update_delay
values, but as you note it still presents a potential footgun when combined with timeouts.
I don't necessarily think the potential conflict between timeouts and cluster_cache_update_delay
is a bug at this point though since the default is 0 and 8.0.2 addresses the conflict with max_write_attempts
. I'll have to do a bit more thinking on this though. Let me know if you have any thoughts.
Redis version 6.2.7 Platform - linux Using Docker and/or Kubernetes - yes Deployment type - cluster Fred Version: v8.0.1
Still incorrect cluster failover behavior under high load.
Initial cluster configuration
Based on the code described here https://github.com/aembke/fred.rs/issues/208:
But at the same time, expand the test scenario by creating a load by 500 parallel read-write requests:
To Reproduce Steps to reproduce the behavior:
127.0.0.1:7006
127.0.0.1:7006
The recovery procedure after a failure took more than 2 minutes. Although the cluster switches to a replica almost instantly. 1000 parallel tasks make things more dramatic (about 3 minutes)