[Bug] Still incorrect cluster failover behavior under high load

neogenie commented 10 months ago

Redis version 6.2.7 Platform - linux Using Docker and/or Kubernetes - yes Deployment type - cluster Fred Version: v8.0.1

Still incorrect cluster failover behavior under high load.

Initial cluster configuration

127.0.0.1:7001> CLUSTER NODES
fe71e451e3c7ffb2a76b8c150863ef4cee675ee5 127.0.0.1:7003@17003 slave fbcca3efe5c305b35afb230ce76d44e8ce045d17 0 1706474488609 2 connected
ac2ad38816c0f0e99e3810ccae9b8a94bb284975 127.0.0.1:7005@17005 master - 0 1706474490000 29 connected 0-5460
bd3b9aedcee33774a5559f73805e37c273c6e4d6 127.0.0.1:7002@17002 master - 0 1706474490000 3 connected 10923-16383
36e2c0e2cbb3afd843c852aa39254101117070eb 127.0.0.1:7006@17006 slave ac2ad38816c0f0e99e3810ccae9b8a94bb284975 0 1706474490590 29 connected
fbcca3efe5c305b35afb230ce76d44e8ce045d17 127.0.0.1:7001@17001 myself,master - 0 1706474489000 2 connected 5461-10922
81562a44b1332615c520bda6892aedced74f51e5 127.0.0.1:7004@17004 slave bd3b9aedcee33774a5559f73805e37c273c6e4d6 0 1706474490589 3 connected

Based on the code described here https://github.com/aembke/fred.rs/issues/208:

let config = RedisConfig {
    server: ServerConfig::Clustered { .. },
    password,
    ..Default::default()
};

let mut builder = Builder::from_config(config);

builder
    .with_performance_config(|config| {
        // Performance Config
        config.backpressure = BackpressureConfig {
            max_in_flight_commands: 50_000_000,
            ..Default::default()
        };
        config.default_command_timeout = Duration::from_secs(30);
    })
    .with_config(|config| {
        // Redis Config
        config.fail_fast = true;
    })
    .with_connection_config(|config| {
        config.connection_timeout = Duration::from_secs(5);
        config.internal_command_timeout = Duration::from_secs(1);
        config.max_command_attempts = 3;

        // TCP Connection Config
        config.tcp = TcpConfig {
            nodelay: Some(true),
            ..Default::default()
        };

        config.replica = ReplicaConfig {
            connection_error_count: 1,
            ..Default::default()
        };

        config.unresponsive = UnresponsiveConfig {
            max_timeout: Some(Duration::from_secs(5)),
            interval: Duration::from_millis(500),
        };

        config.reconnect_errors = vec![
            ReconnectError::ClusterDown,
            ReconnectError::Loading,
            ReconnectError::MasterDown,
            ReconnectError::ReadOnly,
            ReconnectError::Misconf,
            ReconnectError::Busy,
            ReconnectError::NoReplicas,
        ]
    });

builder.set_policy(ReconnectPolicy::new_constant(
    0,
    Duration::from_millis(50).as_millis() as u32,
));

debug!("Create redis connection pool...");
let pool = builder.build_pool(6)?;

tokio::spawn(async move {
    let mut interval = tokio::time::interval(Duration::from_millis(500));
    loop {
        interval.tick().await;
        test(&client).await;
    }
});

But at the same time, expand the test scenario by creating a load by 500 parallel read-write requests:

async fn test(client: &Arc<RedisConnector>) {
    let start = Instant::now();
    let mut tesks = (0..500).fold(JoinSet::new(), |mut tasks, _| {
        let redis = client.clone();
        tasks.spawn(async move {
            redis.set("{KEY}", 12345u64, None, None).await?;
            redis.get_replicas::<_, u64>("{KEY}").await
        });
        tasks
    });

    let mut success = 0;
    let mut errors = 0;

    while let Some(res) = tesks.join_next().await {
        match res.unwrap() {
            Ok(value) => {
                if value != Some(12345u64) {
                    debug!("Redis error: empty value by key!");
                    errors += 1;
                } else {
                    success += 1;
                }
            }
            Err(e) => {
                debug!("Redis error: {e}");
                errors += 1;
            }
        }
    }

    let elapsed = start.elapsed();

    if errors > 0 {
        error!("-[ERR] {success}/{errors} Took {elapsed:?}");
    } else {
        info!("+[OK] {success}/{errors} Took {elapsed:?}");
    }
}

To Reproduce Steps to reproduce the behavior:

Start redis cluster
Ensure that all is ok and fred reads keys from replica node 127.0.0.1:7006
Pause master node 127.0.0.1:7006

2024-01-30T19:26:28.864076Z  INFO test_redis: +[OK] 500/0 Took 26.282042ms
2024-01-30T19:26:29.369582Z  INFO test_redis: +[OK] 500/0 Took 32.477ms
2024-01-30T19:26:29.856401Z  INFO test_redis: +[OK] 500/0 Took 18.599667ms
2024-01-30T19:26:30.366189Z  INFO test_redis: +[OK] 500/0 Took 28.394291ms
2024-01-30T19:26:30.868040Z  INFO test_redis: +[OK] 500/0 Took 29.873125ms
2024-01-30T19:26:31.368004Z  INFO test_redis: +[OK] 500/0 Took 29.410084ms
2024-01-30T19:26:31.872290Z  INFO test_redis: +[OK] 500/0 Took 33.245916ms
2024-01-30T19:26:32.371476Z  INFO test_redis: +[OK] 500/0 Took 33.381875ms
2024-01-30T19:26:32.874456Z  INFO test_redis: +[OK] 500/0 Took 36.565875ms
2024-01-30T19:26:33.379553Z  INFO test_redis: +[OK] 500/0 Took 40.996459ms
2024-01-30T19:26:33.876453Z  INFO test_redis: +[OK] 500/0 Took 38.948916ms
2024-01-30T19:26:34.377370Z  INFO test_redis: +[OK] 500/0 Took 39.621917ms
2024-01-30T19:26:34.877808Z  INFO test_redis: +[OK] 500/0 Took 39.434708ms
2024-01-30T19:26:35.374201Z  INFO test_redis: +[OK] 500/0 Took 35.65625ms
2024-01-30T19:26:35.874835Z  INFO test_redis: +[OK] 500/0 Took 35.782333ms
2024-01-30T19:26:36.356337Z  INFO test_redis: +[OK] 500/0 Took 19.050084ms
2024-01-30T19:26:36.868662Z  INFO test_redis: +[OK] 500/0 Took 30.897708ms
2024-01-30T19:26:42.392516Z  WARN fred::router::utils: fred-92IMw6iozh: Unresponsive connection to 127.0.0.1:7005 after 5.021883s    
2024-01-30T19:26:42.392351Z  WARN fred::router::utils: fred-ic3402p9FZ: Unresponsive connection to 127.0.0.1:7005 after 5.021967667s    
2024-01-30T19:26:42.392437Z  WARN fred::router::utils: fred-q3fI0NzTeh: Unresponsive connection to 127.0.0.1:7005 after 5.022093958s    
2024-01-30T19:26:42.392781Z  WARN fred::router::utils: fred-UpByYedMb4: Unresponsive connection to 127.0.0.1:7005 after 5.022070375s    
2024-01-30T19:26:42.392968Z  WARN fred::router::utils: fred-hiIodBZHwE: Unresponsive connection to 127.0.0.1:7005 after 5.021939709s    
2024-01-30T19:26:42.393007Z  WARN fred::router::utils: fred-6rMVwhS9EK: Unresponsive connection to 127.0.0.1:7005 after 5.021925958s    
2024-01-30T19:26:42.393428Z  WARN fred::router::responses: fred-hiIodBZHwE: Ending replica reader task from 127.0.0.1:7005 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)    
2024-01-30T19:26:42.393456Z  WARN fred::router::responses: fred-q3fI0NzTeh: Ending replica reader task from 127.0.0.1:7005 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)    
2024-01-30T19:26:42.393467Z  WARN fred::router::responses: fred-92IMw6iozh: Ending replica reader task from 127.0.0.1:7005 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)    
2024-01-30T19:26:42.393547Z  WARN fred::router::responses: fred-UpByYedMb4: Ending replica reader task from 127.0.0.1:7005 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)    
2024-01-30T19:26:42.393598Z  WARN fred::router::responses: fred-ic3402p9FZ: Ending replica reader task from 127.0.0.1:7005 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)    
2024-01-30T19:26:42.393674Z  WARN fred::router::responses: fred-6rMVwhS9EK: Ending replica reader task from 127.0.0.1:7005 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)    
2024-01-30T19:27:07.381200Z ERROR test_redis: -[ERR] 0/500 Took 30.039186958s
2024-01-30T19:27:37.408210Z ERROR test_redis: -[ERR] 0/500 Took 30.02579275s
2024-01-30T19:28:07.440298Z ERROR test_redis: -[ERR] 0/500 Took 30.03138325s
2024-01-30T19:28:37.463615Z ERROR test_redis: -[ERR] 0/500 Took 30.02271225s
2024-01-30T19:29:07.487376Z ERROR test_redis: -[ERR] 0/500 Took 30.023235167s
2024-01-30T19:29:19.181563Z  INFO test_redis: +[OK] 500/0 Took 11.693724583s
2024-01-30T19:29:30.463806Z  INFO test_redis: +[OK] 500/0 Took 11.281802208s

The recovery procedure after a failure took more than 2 minutes. Although the cluster switches to a replica almost instantly. 1000 parallel tasks make things more dramatic (about 3 minutes)

neogenie commented 10 months ago

@aembke

aembke commented 10 months ago

What does redis.get_replicas::<_, u64>("{KEY}").await do? Is that just shorthand for client.replicas().get("...").await?

Maybe unrelated to the recovery issues - you may also want to look into https://redis.io/commands/wait/ if you're writing to primaries and then quickly turning around and reading those values from replicas.

Either way thanks for the repro and I'll take a look soon.

neogenie commented 10 months ago

Hi @aembke !

Yes, it's simple wrapper:

pub async fn get_replicas<K, V>(&self, key: K) -> Result<Option<V>, RedisError>
    where
        K: Into<RedisKey> + Clone + Send,
        V: 'static + FromRedis + Unpin + Send,
    {
        trace!("RedisConnector::get_replicas()");
        self.pool.replicas().get(key.clone()).await
    }

Maybe unrelated to the recovery issues - you may also want to look into https://redis.io/commands/wait/ if you're writing to primaries and then quickly turning around and reading those values from replicas.

Exactly unrelated. :) This code is for example only - read and write.

Either way thanks for the repro and I'll take a look soon.

Thank You!

neogenie commented 10 months ago

Stopping (stop, not pause) the replica has more dramatic consequences. The recovery procedure takes more than 10 minutes (!!!) and it doesn’t look like it will ever be completed...

2024-02-01T11:38:43.927741Z  INFO test_redis: +[OK] 1000/0 Took 47.428167ms
2024-02-01T11:38:44.275736Z  WARN fred::router::responses: fred-lHDr936vur: Ending replica reader task from 127.0.0.1:7005 due to None    
2024-02-01T11:38:44.281920Z  WARN fred::router::responses: fred-LLOQom3aks: Ending replica reader task from 127.0.0.1:7005 due to None    
2024-02-01T11:38:44.281990Z  WARN fred::router::responses: fred-eQbTM2H9wT: Ending replica reader task from 127.0.0.1:7005 due to None    
2024-02-01T11:38:44.282262Z  WARN fred::router::responses: fred-9ArMyREOjA: Ending replica reader task from 127.0.0.1:7005 due to None    
2024-02-01T11:38:44.282407Z  WARN fred::router::responses: fred-B0KaZhe4jU: Ending replica reader task from 127.0.0.1:7005 due to None    
2024-02-01T11:38:44.282558Z  WARN fred::router::responses: fred-rLjpIOhB3W: Ending replica reader task from 127.0.0.1:7005 due to None    
2024-02-01T11:39:14.450113Z ERROR test_redis: -[ERR] 715/285 Took 30.069008s
2024-02-01T11:39:44.466894Z ERROR test_redis: -[ERR] 0/1000 Took 30.015474667s
2024-02-01T11:40:14.480707Z ERROR test_redis: -[ERR] 0/1000 Took 30.013087916s
2024-02-01T11:40:44.494171Z ERROR test_redis: -[ERR] 0/1000 Took 30.012756333s
2024-02-01T11:41:14.534803Z ERROR test_redis: -[ERR] 0/1000 Took 30.01477575s
2024-02-01T11:41:44.553189Z ERROR test_redis: -[ERR] 0/1000 Took 30.013585417s
2024-02-01T11:42:14.565962Z ERROR test_redis: -[ERR] 0/1000 Took 30.011364542s
2024-02-01T11:42:44.580175Z ERROR test_redis: -[ERR] 0/1000 Took 30.013245708s
2024-02-01T11:43:14.595038Z ERROR test_redis: -[ERR] 0/1000 Took 30.01393975s
2024-02-01T11:43:52.643292Z ERROR test_redis: -[ERR] 20/980 Took 38.047119417s
2024-02-01T11:44:42.708879Z ERROR test_redis: -[ERR] 163/837 Took 50.064030334s
2024-02-01T11:45:12.745757Z ERROR test_redis: -[ERR] 85/915 Took 30.035667875s
2024-02-01T11:45:42.764750Z ERROR test_redis: -[ERR] 136/864 Took 30.017902125s
2024-02-01T11:46:12.787971Z ERROR test_redis: -[ERR] 127/873 Took 30.022322584s
2024-02-01T11:46:42.807797Z ERROR test_redis: -[ERR] 110/890 Took 30.018946375s
2024-02-01T11:47:12.829260Z ERROR test_redis: -[ERR] 79/921 Took 30.020552875s
2024-02-01T11:47:42.849937Z ERROR test_redis: -[ERR] 99/901 Took 30.019773458s
2024-02-01T11:48:12.868937Z ERROR test_redis: -[ERR] 143/857 Took 30.018112042s
2024-02-01T11:48:42.890067Z ERROR test_redis: -[ERR] 113/887 Took 30.020275458s

aembke commented 10 months ago

This should be fixed in https://github.com/aembke/fred.rs/pull/213.

In my repro the cluster took about 20 seconds to notice and promote the replica node, so make sure to tune the timeout and retry attempt counts accordingly. The values I used are here: https://github.com/aembke/fred.rs/blob/62eef17be62735d48c3ec114ef219ba852f5fe45/bin/replica_consistency/src/main.rs#L89-L124

neogenie commented 10 months ago

Hi, @aembke ! Great job! Thank You!

Looks like issue is fixed. BUT failover procedure is broken again...

2024-02-05T06:16:49.297639Z  INFO test_redis: +[OK] 1000/0 Took 51.926417ms
2024-02-05T06:16:54.829474Z  WARN fred::router::utils: fred-YEIuoVbipl: Unresponsive connection to 127.0.0.1:7006 after 5.029120292s    
2024-02-05T06:16:54.829474Z  WARN fred::router::utils: fred-ETIAqyuCud: Unresponsive connection to 127.0.0.1:7006 after 5.028842167s    
2024-02-05T06:16:54.829716Z  WARN fred::router::utils: fred-koORDGrZKf: Unresponsive connection to 127.0.0.1:7006 after 5.029001917s    
2024-02-05T06:16:54.830613Z  WARN fred::router::responses: fred-YEIuoVbipl: Ending reader task from 127.0.0.1:7006 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)    
2024-02-05T06:16:54.831580Z  WARN fred::router::responses: fred-ETIAqyuCud: Ending reader task from 127.0.0.1:7006 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)    
2024-02-05T06:16:54.832018Z  WARN fred::router::utils: fred-7idWtPr5WG: Unresponsive connection to 127.0.0.1:7006 after 5.029236042s    
2024-02-05T06:16:54.832036Z  WARN fred::router::responses: fred-koORDGrZKf: Ending reader task from 127.0.0.1:7006 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)    
2024-02-05T06:16:54.832751Z  WARN fred::router::responses: fred-7idWtPr5WG: Ending reader task from 127.0.0.1:7006 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)    
2024-02-05T06:16:54.833551Z  WARN fred::router::utils: fred-HI2RLD29uA: Unresponsive connection to 127.0.0.1:7006 after 5.029117166s    
2024-02-05T06:16:54.833886Z  WARN fred::router::responses: fred-HI2RLD29uA: Ending reader task from 127.0.0.1:7006 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)    
2024-02-05T06:16:54.841037Z  WARN fred::router::utils: fred-EONp3Bs9YS: Unresponsive connection to 127.0.0.1:7006 after 5.041559583s    
2024-02-05T06:16:54.841288Z  WARN fred::router::responses: fred-EONp3Bs9YS: Ending reader task from 127.0.0.1:7006 due to Some(Redis Error - kind: IO, details: Unresponsive connection.)    
2024-02-05T06:16:54.891370Z  WARN fred::router::responses: fred-koORDGrZKf: Ending reader task from 127.0.0.1:7002 due to None    
2024-02-05T06:16:54.892535Z  WARN fred::router::responses: fred-koORDGrZKf: Ending reader task from 127.0.0.1:7001 due to None    
2024-02-05T06:16:54.904923Z  WARN fred::router::responses: fred-EONp3Bs9YS: Ending reader task from 127.0.0.1:7001 due to None    
2024-02-05T06:16:54.905680Z  WARN fred::router::responses: fred-EONp3Bs9YS: Ending reader task from 127.0.0.1:7002 due to None    
2024-02-05T06:16:54.905824Z  WARN fred::router::responses: fred-YEIuoVbipl: Ending reader task from 127.0.0.1:7001 due to None    
2024-02-05T06:16:54.906428Z  WARN fred::router::responses: fred-YEIuoVbipl: Ending reader task from 127.0.0.1:7002 due to None    
2024-02-05T06:16:54.915022Z  WARN fred::router::responses: fred-ETIAqyuCud: Ending reader task from 127.0.0.1:7001 due to None    
2024-02-05T06:16:54.915420Z  WARN fred::router::responses: fred-ETIAqyuCud: Ending reader task from 127.0.0.1:7002 due to None    
2024-02-05T06:16:54.959773Z  WARN fred::router::responses: fred-HI2RLD29uA: Ending reader task from 127.0.0.1:7002 due to None    
2024-02-05T06:16:54.960130Z  WARN fred::router::responses: fred-HI2RLD29uA: Ending reader task from 127.0.0.1:7001 due to None    
2024-02-05T06:16:54.969813Z  WARN fred::router::responses: fred-7idWtPr5WG: Ending reader task from 127.0.0.1:7002 due to None    
2024-02-05T06:16:54.970270Z  WARN fred::router::responses: fred-7idWtPr5WG: Ending reader task from 127.0.0.1:7001 due to None    
2024-02-05T06:16:57.311220Z  INFO test_redis: +[OK] 1000/0 Took 7.563554666s
2024-02-05T06:16:57.345472Z  INFO test_redis: +[OK] 1000/0 Took 34.185125ms
2024-02-05T06:16:57.375784Z  INFO test_redis: +[OK] 1000/0 Took 30.25375ms
2024-02-05T06:16:57.406351Z  INFO test_redis: +[OK] 1000/0 Took 30.506083ms
2024-02-05T06:16:57.440941Z  INFO test_redis: +[OK] 1000/0 Took 34.528125ms
2024-02-05T06:16:57.476054Z  INFO test_redis: +[OK] 1000/0 Took 35.057291ms
2024-02-05T06:16:57.505697Z  INFO test_redis: +[OK] 1000/0 Took 29.588584ms
2024-02-05T06:16:57.536814Z  INFO test_redis: +[OK] 1000/0 Took 31.061334ms
2024-02-05T06:16:57.562554Z  INFO test_redis: +[OK] 1000/0 Took 25.682541ms
2024-02-05T06:16:57.589706Z  INFO test_redis: +[OK] 1000/0 Took 27.090875ms
2024-02-05T06:16:57.617016Z  INFO test_redis: +[OK] 1000/0 Took 27.244083ms
2024-02-05T06:16:57.645447Z  INFO test_redis: +[OK] 1000/0 Took 28.374875ms
2024-02-05T06:16:57.672728Z  INFO test_redis: +[OK] 1000/0 Took 27.205ms
2024-02-05T06:16:57.697096Z  INFO test_redis: +[OK] 1000/0 Took 24.306917ms
2024-02-05T06:16:57.722746Z  INFO test_redis: +[OK] 1000/0 Took 25.587ms
2024-02-05T06:16:57.748388Z  INFO test_redis: +[OK] 1000/0 Took 25.583375ms
2024-02-05T06:16:57.776261Z  INFO test_redis: +[OK] 1000/0 Took 27.812375ms
2024-02-05T06:16:58.292781Z  INFO test_redis: +[OK] 1000/0 Took 45.68775ms
2024-02-05T06:16:58.788668Z  INFO test_redis: +[OK] 1000/0 Took 42.563666ms
2024-02-05T06:16:59.280182Z  INFO test_redis: +[OK] 1000/0 Took 34.642209ms
2024-02-05T06:16:59.793809Z  INFO test_redis: +[OK] 1000/0 Took 48.0295ms
2024-02-05T06:17:00.293328Z  INFO test_redis: +[OK] 1000/0 Took 47.276042ms
2024-02-05T06:17:00.795847Z  INFO test_redis: +[OK] 1000/0 Took 46.226875ms
2024-02-05T06:17:01.290629Z  INFO test_redis: +[OK] 1000/0 Took 44.24875ms
2024-02-05T06:17:01.796743Z  INFO test_redis: +[OK] 1000/0 Took 51.494084ms
2024-02-05T06:17:02.298013Z  INFO test_redis: +[OK] 1000/0 Took 51.381917ms
2024-02-05T06:17:02.794077Z  INFO test_redis: +[OK] 1000/0 Took 45.53975ms
2024-02-05T06:17:03.292757Z  INFO test_redis: +[OK] 1000/0 Took 45.195833ms
2024-02-05T06:17:03.788030Z  INFO test_redis: +[OK] 1000/0 Took 41.859791ms
2024-02-05T06:17:04.297260Z  INFO test_redis: +[OK] 1000/0 Took 52.14725ms
2024-02-05T06:17:04.811522Z  INFO test_redis: +[OK] 1000/0 Took 55.847625ms
2024-02-05T06:17:05.312864Z  INFO test_redis: +[OK] 1000/0 Took 66.387709ms
2024-02-05T06:17:05.815741Z  INFO test_redis: +[OK] 1000/0 Took 64.425625ms
2024-02-05T06:17:06.300117Z  INFO test_redis: +[OK] 1000/0 Took 54.191417ms
2024-02-05T06:17:06.807931Z  INFO test_redis: +[OK] 1000/0 Took 61.80575ms
2024-02-05T06:17:07.306839Z  INFO test_redis: +[OK] 1000/0 Took 59.837208ms
2024-02-05T06:17:07.809151Z  INFO test_redis: +[OK] 1000/0 Took 60.88275ms
2024-02-05T06:17:08.300369Z  INFO test_redis: +[OK] 1000/0 Took 51.3125ms
2024-02-05T06:17:08.798675Z  INFO test_redis: +[OK] 1000/0 Took 52.130667ms
2024-02-05T06:17:09.300353Z  INFO test_redis: +[OK] 1000/0 Took 53.593417ms
2024-02-05T06:17:09.792797Z  INFO test_redis: +[OK] 1000/0 Took 47.176667ms
2024-02-05T06:17:10.298523Z  INFO test_redis: +[OK] 1000/0 Took 52.714083ms
2024-02-05T06:17:10.799380Z  INFO test_redis: +[OK] 1000/0 Took 52.777583ms
2024-02-05T06:17:11.299379Z  INFO test_redis: +[OK] 1000/0 Took 51.69ms
2024-02-05T06:17:11.800617Z  INFO test_redis: +[OK] 1000/0 Took 54.530583ms
2024-02-05T06:17:12.293402Z  INFO test_redis: +[OK] 1000/0 Took 48.393708ms
2024-02-05T06:17:12.800638Z  INFO test_redis: +[OK] 1000/0 Took 54.106834ms
2024-02-05T06:17:13.294047Z  INFO test_redis: +[OK] 1000/0 Took 47.259541ms
2024-02-05T06:17:13.799640Z  INFO test_redis: +[OK] 1000/0 Took 53.710583ms
2024-02-05T06:17:14.298008Z  INFO test_redis: +[OK] 1000/0 Took 52.323167ms
2024-02-05T06:17:14.810454Z  INFO test_redis: +[OK] 1000/0 Took 62.208584ms
2024-02-05T06:17:15.290926Z  INFO test_redis: +[OK] 1000/0 Took 44.847334ms
2024-02-05T06:17:15.794860Z  INFO test_redis: +[OK] 1000/0 Took 49.138834ms
2024-02-05T06:17:16.284327Z  INFO test_redis: +[OK] 1000/0 Took 38.358834ms
2024-02-05T06:17:16.807748Z  INFO test_redis: +[OK] 1000/0 Took 61.973375ms
2024-02-05T06:17:17.305323Z  INFO test_redis: +[OK] 1000/0 Took 58.564125ms
2024-02-05T06:17:17.792644Z  INFO test_redis: +[OK] 1000/0 Took 47.008584ms
2024-02-05T06:17:18.288805Z  INFO test_redis: +[OK] 1000/0 Took 42.868792ms
2024-02-05T06:17:18.778868Z  INFO test_redis: +[OK] 1000/0 Took 32.866958ms
2024-02-05T06:17:19.294915Z  INFO test_redis: +[OK] 1000/0 Took 48.823875ms
2024-02-05T06:17:19.794890Z  INFO test_redis: +[OK] 1000/0 Took 48.395791ms
2024-02-05T06:17:20.294193Z  INFO test_redis: +[OK] 1000/0 Took 48.277792ms
2024-02-05T06:17:20.792790Z  INFO test_redis: +[OK] 1000/0 Took 47.009542ms
2024-02-05T06:17:21.294170Z  INFO test_redis: +[OK] 1000/0 Took 49.004208ms
2024-02-05T06:17:21.792329Z  INFO test_redis: +[OK] 1000/0 Took 47.100459ms
2024-02-05T06:17:22.288217Z  INFO test_redis: +[OK] 1000/0 Took 41.982458ms
2024-02-05T06:17:22.795953Z  INFO test_redis: +[OK] 1000/0 Took 50.392041ms
2024-02-05T06:17:23.327127Z  INFO test_redis: +[OK] 1000/0 Took 78.193875ms
2024-02-05T06:17:23.789954Z  INFO test_redis: +[OK] 1000/0 Took 43.816333ms
2024-02-05T06:17:24.297945Z  INFO test_redis: +[OK] 1000/0 Took 50.638292ms
2024-02-05T06:17:24.812503Z  INFO test_redis: +[OK] 1000/0 Took 64.797084ms
2024-02-05T06:17:25.293402Z  INFO test_redis: +[OK] 1000/0 Took 46.272958ms
2024-02-05T06:17:25.791725Z  INFO test_redis: +[OK] 1000/0 Took 46.1095ms
2024-02-05T06:17:26.299810Z  INFO test_redis: +[OK] 1000/0 Took 53.743166ms
2024-02-05T06:17:26.797863Z  INFO test_redis: +[OK] 1000/0 Took 52.334166ms
2024-02-05T06:17:27.301204Z  INFO test_redis: +[OK] 1000/0 Took 53.56725ms
2024-02-05T06:17:27.795943Z  INFO test_redis: +[OK] 1000/0 Took 49.924542ms
2024-02-05T06:17:28.311294Z  INFO test_redis: +[OK] 1000/0 Took 65.154166ms
2024-02-05T06:17:28.801559Z  INFO test_redis: +[OK] 1000/0 Took 54.586166ms
2024-02-05T06:17:29.291364Z  INFO test_redis: +[OK] 1000/0 Took 45.245584ms
2024-02-05T06:17:29.807761Z  INFO test_redis: +[OK] 1000/0 Took 61.308292ms
2024-02-05T06:17:30.296979Z  INFO test_redis: +[OK] 1000/0 Took 50.437125ms
2024-02-05T06:17:35.784839Z  WARN fred::router::responses: fred-7idWtPr5WG: Ending reader task from 127.0.0.1:7005 due to None    
2024-02-05T06:17:35.785277Z  WARN fred::router::responses: fred-YEIuoVbipl: Ending reader task from 127.0.0.1:7005 due to None    
2024-02-05T06:17:35.785878Z  WARN fred::router::responses: fred-koORDGrZKf: Ending reader task from 127.0.0.1:7005 due to None    
2024-02-05T06:17:35.787715Z  WARN fred::router::responses: fred-ETIAqyuCud: Ending reader task from 127.0.0.1:7005 due to None    
2024-02-05T06:17:35.789610Z  WARN fred::router::responses: fred-EONp3Bs9YS: Ending reader task from 127.0.0.1:7005 due to None    
2024-02-05T06:17:35.794223Z  WARN fred::router::responses: fred-HI2RLD29uA: Ending reader task from 127.0.0.1:7005 due to None    
2024-02-05T06:18:25.831530Z ERROR test_redis: -[ERR] 0/1000 Took 55.084402084s
2024-02-05T06:18:55.851215Z ERROR test_redis: -[ERR] 0/1000 Took 30.019148125s
2024-02-05T06:19:25.866729Z ERROR test_redis: -[ERR] 0/1000 Took 30.015163875s
2024-02-05T06:19:55.890015Z ERROR test_redis: -[ERR] 0/1000 Took 30.022939083s
2024-02-05T06:20:25.912598Z ERROR test_redis: -[ERR] 0/1000 Took 30.022254958s
2024-02-05T06:20:55.944116Z ERROR test_redis: -[ERR] 0/1000 Took 30.031181583s
2024-02-05T06:21:25.961834Z ERROR test_redis: -[ERR] 0/1000 Took 30.017383417s
2024-02-05T06:21:55.904257Z ERROR test_redis: -[ERR] 0/1000 Took 30.011577542s

Scenario:

Pause master 127.0.0.1:7006. Ok
Unpause old master 127.0.0.1:7006. It's replica now.
Run CLUSTER FAILOVER to return 127.0.0.1:7006 master role
Fail

aembke commented 10 months ago

What error message are you getting on those commands? I was able to repro the last one but this works for me on main currently. Are you still using the same configuration values from your original post?

I'm using https://github.com/aembke/fred.rs/tree/main/bin/replica_consistency to test this. Are you able to repro the issue there by chance? It uses foo as the key, which always maps to redis-cluster-3 first, then fails over to redis-cluster-4.

If you use that you can get a redis-cli session going inside the docker network like this:

cd path/to/fred
source ./tests/environ
./tests/runners/docker-bash.sh

# need this again in the container, but respond with yes to build redis-cli
source ./tests/environ
fred_redis_cli -a bar -p 30001 -h redis-cluster-3

neogenie commented 10 months ago

Default config.cluster_cache_update_delay = 0 seems to fix everything.

neogenie commented 10 months ago

100ms leads to a bunch of errors, but still recovery, 5s - to complete degradation of the library, as in the previous post.

aembke commented 10 months ago

What are those errors? Timeouts?

neogenie commented 10 months ago

Timeouts?

Yes, 30s config.default_command_timeout

aembke commented 10 months ago

Sounds good. Just wanted to make sure those commands weren't being dropped unexpectedly.

I'll try to get 8.0.2 published this week.

neogenie commented 9 months ago

Hi, @aembke !

Any news?)

aembke commented 9 months ago

Just published 8.0.2, so the initial thing you reported here should be fixed.

I'll need to do a bit more thinking about cluster_cache_update_delay. Initially in the 8.0.1 fix I mentioned that I had to tune this to 5s to make it work, but the reasoning for this change was incorrect at the time.

I set that to 5s in the 8.0.1 repro to reduce the number of write attempts, since that was tuned to be about 3 at the time. If the cluster takes ~20 sec to fail over, and we need to rely on write+timeout tricks to detect a paused container, then we'd need at least ~10 attempts with an exponential backoff starting at 100ms. The 5s cluster sync delay config change was an attempt to make this work without having to change the max write attempts to some big number. This was the wrong approach though since it requires knowing how these two config options are used internally.

In this latest 8.0.2 patch I changed the retry logic to specifically not increment the write attempt counter when we're first testing replica connectivity after reconnecting to a cluster node. This should fix the tension between the max_write_attempts and cluster_cache_update_delay values, but as you note it still presents a potential footgun when combined with timeouts.

I don't necessarily think the potential conflict between timeouts and cluster_cache_update_delay is a bug at this point though since the default is 0 and 8.0.2 addresses the conflict with max_write_attempts. I'll have to do a bit more thinking on this though. Let me know if you have any thoughts.

aembke / fred.rs

[Bug] Still incorrect cluster failover behavior under high load #212