ServiceStack / Issues

Issue Tracker for the commercial versions of ServiceStack
11 stars 8 forks source link

ServiceStack.Redis has inconsistent behavior with uri configuration and non-uri configuration #772

Closed ardove closed 2 years ago

ardove commented 2 years ago

Hi there - in an attempt to debug a strange issue we're observing with the redis client, I switched over from using a programatic SelectAsync to select the database to specifying the database on the connection URI.

ServiceStack.Redis version (from NuGet.org) - 5.13.2 Redis host version - 4.0.10

Here are the two abbreviated examples. The first, works as I'd expect it to, correctly connecting to the host and retrieving an empty value.

class Program
{
    static async Task Main(string[] args)
    {
        LogManager.LogFactory = new ConsoleLogFactory();

        Licensing.RegisterLicense("<removed>");

        var redisManagerPool = new RedisManagerPool("redisHost");

        for (int i = 0; i < 5; i++)
        {
            try
            {
                await using (IRedisClientAsync client = await redisManagerPool.GetClientAsync())
                {
                    await client.SelectAsync(7);

                    string value = await client.GetValueAsync("status");

                    Console.WriteLine($"Successfully retrieved value => '{value}'");
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Exception handled \n{ex}");
            }

            Console.WriteLine("Sleeping for 25 seconds to allow client to be garbage collected");
            Thread.Sleep(TimeSpan.FromSeconds(25));
        }
    }
}

And here's the output

DEBUG: New Redis Masters: redisHost:6379
DEBUG: New Redis Replicas:
DEBUG: #2 Attempting sync connection to 'redisHost:6379' (SEND -1, RECV -1 Timeouts)...
DEBUG: #2 Socket connected to 'redisHost:6379'
Successfully retrieved value => ''
Sleeping for 25 seconds to allow client to be garbage collected
DEBUG: #2 Attempting sync connection to 'redisHost:6379' (SEND -1, RECV -1 Timeouts)...
DEBUG: #2 Socket connected to 'redisHost:6379'
ERROR: #2 [04:19:45.462] No more data, sPort: 60403, LastCommand:
DEBUG: #2 SendReceive Exception: [04:19:45.462] No more data, sPort: 60403, LastCommand:
Successfully retrieved value => ''
Sleeping for 25 seconds to allow client to be garbage collected
DEBUG: #2 Attempting sync connection to 'redisHost:6379' (SEND -1, RECV -1 Timeouts)...
DEBUG: #2 Socket connected to 'redisHost:6379'
ERROR: #2 [04:20:10.953] No more data, sPort: 60414, LastCommand:
DEBUG: #2 SendReceive Exception: [04:20:10.953] No more data, sPort: 60414, LastCommand:
Successfully retrieved value => ''
Sleeping for 25 seconds to allow client to be garbage collected

The second, works the first time, logs an exception on the 2nd time through the loop, works on the 3rd iteration, logs another exception on the 4th iteration, and works on the 5th iteration.

class Program
{
    static void Main(string[] args)
    {
        LogManager.LogFactory = new ConsoleLogFactory();

        Licensing.RegisterLicense("<removed>");

        var redisManagerPool = new RedisManagerPool("redis://redisHost?db=7");

        for (int i = 0; i < 5; i++)
        {
            try
            {
                using (IRedisClient client = redisManagerPool.GetClient())
                {
                    string value = client.GetValue("status");

                    Console.WriteLine($"Successfully retrieved value => '{value}'");
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Exception handled \n{ex}");
            }

            Console.WriteLine("Sleeping for 25 seconds to allow client to be garbage collected");
            Thread.Sleep(TimeSpan.FromSeconds(25));
        }
    }
}

And here's the output

DEBUG: New Redis Masters: redisHost:6379
DEBUG: New Redis Replicas:
DEBUG: #2 Attempting async connection to 'redisHost:6379' (SEND -1, RECV -1 Timeouts)...
DEBUG: #2 Socket connected to 'redisHost:6379'
Successfully retrieved value => ''
Sleeping for 25 seconds to allow client to be garbage collected
ERROR: #2 Zero length response
DEBUG: #2 SendReceive Exception: Zero length response
Exception handled
ServiceStack.Redis.RedisResponseException: Zero length response
   at ServiceStack.Redis.RedisNativeClient.ParseSingleLine(String r) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 1018
   at ServiceStack.Redis.RedisNativeClient.ReadData() in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 1010
   at ServiceStack.Redis.RedisNativeClient.SendReceive[T](Byte[][] cmdWithBinaryArgs, Func`1 fn, Action`1 completePipelineFn, Boolean sendWithoutRead) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 689
   at ServiceStack.Redis.RedisNativeClient.SendExpectData(Byte[][] cmdWithBinaryArgs) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 791
   at ServiceStack.Redis.RedisNativeClient.GetBytes(String key) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient.cs:line 582
   at ServiceStack.Redis.RedisNativeClient.Get(String key) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient.cs:line 553
   at ServiceStack.Redis.RedisClient.GetValue(String key) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisClient.cs:line 262
   at ConsoleApp3.Program.Main(String[] args) in C:\Users\user\source\repos\ConsoleApp3\ConsoleApp3\Program.cs:line 25
Sleeping for 25 seconds to allow client to be garbage collected
DEBUG: #2 SendReceive Exception: An established connection was aborted by the software in your host machine.
ERROR: #2 SocketException in SendReceive, retrying..., Exception: An established connection was aborted by the software in your host machine.
DEBUG: #2 Attempting async connection to 'redisHost:6379' (SEND -1, RECV -1 Timeouts)...
DEBUG: #2 Socket connected to 'redisHost:6379'
DEBUG: #3 Attempting async connection to 'redisHost:6379' (SEND -1, RECV -1 Timeouts)...
DEBUG: #3 Socket connected to 'redisHost:6379'
Successfully retrieved value => ''
Sleeping for 25 seconds to allow client to be garbage collected

I'd expect both of these samples to work exactly the same, but I've come to learn that this isn't the case.

In case you're curious about the original exception I'm seeing, here's an abbreviated stack trace. It's not quite clear what that means and it only happens infrequently. Yes, I've validated that all of the places where we use an IRedisClientAsync are correctly disposed, and that this particular call only fails sometimes.

We were considering turning up the number of connections allowed in the pool but I can't tell if that's going to help the problem or not.

ServiceStack.Redis.RedisException: Exceeded timeout of 00:00:10
 ---> ServiceStack.Redis.RedisRetryableException: [18:44:31.749] No more data, sPort: 59678, LastCommand: 
   at ServiceStack.Redis.RedisNativeClient.CreateRetryableResponseError(String error) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.cs:line 375
   at ServiceStack.Redis.RedisNativeClient.ExpectSuccessAsync(CancellationToken token) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.Async.cs:line 510
   at ServiceStack.Redis.RedisNativeClient.<ExpectSuccessFnAsync>g__Awaited|639_0(ValueTask pending) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.Async.cs:line 494
   at ServiceStack.Redis.RedisNativeClient.SendReceiveAsync[T](Byte[][] cmdWithBinaryArgs, Func`2 fn, CancellationToken token, Action`1 completePipelineFn, Boolean sendWithoutRead) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.Async.cs:line 210
   --- End of inner exception stack trace ---
   at ServiceStack.Redis.RedisNativeClient.SendReceiveAsync[T](Byte[][] cmdWithBinaryArgs, Func`2 fn, CancellationToken token, Action`1 completePipelineFn, Boolean sendWithoutRead) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\RedisNativeClient_Utils.Async.cs:line 237
   at ServiceStack.Redis.Internal.ValueTask_Utils.<Await>g__Awaited|0_0[T](ValueTask`1 pending) in C:\BuildAgent\work\b2a0bfe2b1c9a118\src\ServiceStack.Redis\ValueTask_Utils.Async.cs:line 23
mythz commented 2 years ago

How reproducible is this, I've ran this a few times on both localhost and a remote Redis Server in a different continent and all times runs without issue:

LogManager.LogFactory = new ConsoleLogFactory();

var redisManagerPool = new RedisManagerPool("redis://web-templates.io?db=7");
for (int i = 0; i < 5; i++)
{
    try
    {
        using var client = redisManagerPool.GetClient();
        string value = client.GetValue("status");

        Console.WriteLine($"Successfully retrieved value => '{value}'");
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Exception handled \n{ex}");
    }

    Console.WriteLine("Sleeping for 25 seconds to allow client to be garbage collected");
    Thread.Sleep(TimeSpan.FromSeconds(25));
}            

Output:

DEBUG: New Redis Masters: web-templates.io:6379
DEBUG: New Redis Replicas:
DEBUG: #2 Attempting sync connection to 'web-templates.io:6379' (SEND -1, RECV -1 Timeouts)...
DEBUG: #2 Socket connected to 'web-templates.io:6379'
Successfully retrieved value => ''
Sleeping for 25 seconds to allow client to be garbage collected
Successfully retrieved value => ''
Sleeping for 25 seconds to allow client to be garbage collected
Successfully retrieved value => ''
Sleeping for 25 seconds to allow client to be garbage collected
Successfully retrieved value => ''
Sleeping for 25 seconds to allow client to be garbage collected
Successfully retrieved value => ''
Sleeping for 25 seconds to allow client to be garbage collected

Can you reproduce this when connecting to web-templates.io as above? if not can you provide a reproducible example I can run that repros the issue against the remote web-templates.io instance.

The RedisException: Exceeded timeout of 00:00:10 exception suggests it failed to try to establish a TCP connection to your Redis host within the specified RedisConfig.DefaultRetryTimeout of 10,000ms but this StackTrace came from an async call, i.e. not from the sample code above.

ardove commented 2 years ago

It reproduces every single time for me against our (internal) redis host. The only thing I can think is different is the redis host version.

I don't see any documentation on host/client version compatibility, but, like I mentioned, we're running a relatively old version (4.0.10).

The original exception comes from code that is identical to the first sample I provided (which is also async and doesn't have the problem I'm describing).

mythz commented 2 years ago

We wont be able to investigate further without a reproducible issue we can run locally to repro the issue so that it's isolated from any local environment or network issues, which we'd need to be able to identify and resolve any issue.

IMO it's unlikely, but If you think it's related to redis version, try to see if you can repro the issue with a newer redis-server running in your same environment.

ardove commented 2 years ago

I think you're right that it's not specific to the redis version. As it turns out, I think this is actually related something with the way that our host and the client are handling connections.

For our host, idle connections disappear from CLIENT LIST after about 20 seconds, but I haven't seen this naturally occur with your host. I tried killing a host manually using CLIENT KILL ID <id number>, but that doesn't reproduce it on your host.

After the 20 second window on your host, I see that the connection associated with the get command is still open

web-templates.io:6379> client list
id=13 addr=73.202.100.102:54599 fd=7 name= age=222 idle=222 flags=N db=7 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=get
id=14 addr=73.202.100.102:54649 fd=8 name= age=1 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=26 qbuf-free=32742 obl=0 oll=0 omem=0 events=r cmd=client

But on ours, I don't.

I'm not a redis expert, so I'm not even sure what else to look at. Here's output of the INFO command in case that means anything to you.

# Server
redis_version:4.0.10
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:b0dc8fd5e351adf8
redis_mode:standalone
os:Linux 4.14.256-197.484.amzn2.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:7.3.1
process_id:2235
run_id:1ebbaf8eef36a71b2efdee3f4d7da1b6196c800b
tcp_port:6379
uptime_in_seconds:3061090
uptime_in_days:35
hz:10
lru_clock:15219296
executable:/usr/bin/redis-server
config_file:/etc/redis.conf

# Clients
connected_clients:2
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:873376
used_memory_human:852.91K
used_memory_rss:5230592
used_memory_rss_human:4.99M
used_memory_peak:6857024
used_memory_peak_human:6.54M
used_memory_peak_perc:12.74%
used_memory_overhead:853680
used_memory_startup:786584
used_memory_dataset:19696
used_memory_dataset_perc:22.69%
total_system_memory:8052834304
total_system_memory_human:7.50G
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:0
maxmemory_human:0B
maxmemory_policy:allkeys-lru
mem_fragmentation_ratio:5.99
mem_allocator:jemalloc-4.0.3
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:3
rdb_bgsave_in_progress:0
rdb_last_save_time:1642608778
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:475136
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:8562
total_commands_processed:132883
instantaneous_ops_per_sec:0
total_net_input_bytes:29741881
total_net_output_bytes:3000913
instantaneous_input_kbps:0.01
instantaneous_output_kbps:1.74
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:4334
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:2361
keyspace_misses:24852
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:150
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Replication
role:master
connected_slaves:0
master_replid:fda3961fb8ec4119f9fea227bed37a39493fa36c
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:725.75
used_cpu_user:407.25
used_cpu_sys_children:0.23
used_cpu_user_children:0.72

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=4,expires=0,avg_ttl=0
db7:keys=4,expires=4,avg_ttl=3158941

While I appreciate your need for a repro, this is obviously happening in our environment or I wouldn't have filed an issue. Like I said, I'm not a redis expert so I don't even know where to look to give you differences in our environment. You are a redis expert, so, are there any other steps that I can take to give you information that you need?

ardove commented 2 years ago

OK, I think I may have found the problem. Our redis configuration specifies a timeout property of 20. Yours specifies a timeout property of 0. I can only assume this is for security reasons (while I didn't configure this host of ours, a neverending client timeout just seems bad).

Here's the output of our config

1) dbfilename
2) dump.rdb
3) requirepass
4)
5) masterauth
6)
7) cluster-announce-ip
8)
9) unixsocket
10)
11) logfile
12) /var/log/redis/redis.log
13) pidfile
14) /var/run/redis_6379.pid
15) slave-announce-ip
16)
17) maxmemory
18) 0
19) proto-max-bulk-len
20) 536870912
21) client-query-buffer-limit
22) 1073741824
23) maxmemory-samples
24) 5
25) lfu-log-factor
26) 10
27) lfu-decay-time
28) 1
29) timeout
30) 20
31) active-defrag-threshold-lower
32) 10
33) active-defrag-threshold-upper
34) 100
35) active-defrag-ignore-bytes
36) 104857600
37) active-defrag-cycle-min
38) 25
39) active-defrag-cycle-max
40) 75
41) auto-aof-rewrite-percentage
42) 100
43) auto-aof-rewrite-min-size
44) 67108864
45) hash-max-ziplist-entries
46) 512
47) hash-max-ziplist-value
48) 64
49) list-max-ziplist-size
50) -2
51) list-compress-depth
52) 0
53) set-max-intset-entries
54) 512
55) zset-max-ziplist-entries
56) 128
57) zset-max-ziplist-value
58) 64
59) hll-sparse-max-bytes
60) 3000
61) lua-time-limit
62) 5000
63) slowlog-log-slower-than
64) 10000
65) latency-monitor-threshold
66) 0
67) slowlog-max-len
68) 128
69) port
70) 6379
71) cluster-announce-port
72) 0
73) cluster-announce-bus-port
74) 0
75) tcp-backlog
76) 511
77) databases
78) 16
79) repl-ping-slave-period
80) 10
81) repl-timeout
82) 60
83) repl-backlog-size
84) 1048576
85) repl-backlog-ttl
86) 3600
87) maxclients
88) 10000
89) watchdog-period
90) 0
91) slave-priority
92) 100
93) slave-announce-port
94) 0
95) min-slaves-to-write
96) 0
97) min-slaves-max-lag
98) 10
99) hz
100) 10
101) cluster-node-timeout
102) 15000
103) cluster-migration-barrier
104) 1
105) cluster-slave-validity-factor
106) 10
107) repl-diskless-sync-delay
108) 5
109) tcp-keepalive
110) 300
111) cluster-require-full-coverage
112) yes
113) cluster-slave-no-failover
114) no
115) no-appendfsync-on-rewrite
116) no
117) slave-serve-stale-data
118) yes
119) slave-read-only
120) yes
121) stop-writes-on-bgsave-error
122) yes
123) daemonize
124) no
125) rdbcompression
126) yes
127) rdbchecksum
128) yes
129) activerehashing
130) yes
131) activedefrag
132) no
133) protected-mode
134) no
135) repl-disable-tcp-nodelay
136) no
137) repl-diskless-sync
138) no
139) aof-rewrite-incremental-fsync
140) yes
141) aof-load-truncated
142) yes
143) aof-use-rdb-preamble
144) no
145) lazyfree-lazy-eviction
146) no
147) lazyfree-lazy-expire
148) no
149) lazyfree-lazy-server-del
150) no
151) slave-lazy-flush
152) no
153) maxmemory-policy
154) allkeys-lru
155) loglevel
156) notice
157) supervised
158) systemd
159) appendfsync
160) everysec
161) syslog-facility
162) local0
163) appendonly
164) no
165) dir
166) /var/lib/redis
167) save
168) 900 1 300 10 60 10000
169) client-output-buffer-limit
170) normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60
171) unixsocketperm
172) 0
173) slaveof
174)
175) notify-keyspace-events
176)
177) bind
178)

I set the config to 20 on your host and it produces new debug logging, but no exception.

DEBUG: #2 Attempting sync connection to 'web-templates.io:6379' (SEND -1, RECV -1 Timeouts)...
DEBUG: #2 Socket connected to 'web-templates.io:6379'
ERROR: #2 [16:37:48.252] No more data, sPort: 52966, LastCommand:
DEBUG: #2 SendReceive Exception: [16:37:48.252] No more data, sPort: 52966, LastCommand:

I don't have another host in my environment that runs v5.x of the redis host, but I'm willing to wager this is a nuance with the way that client connections are managed by the host.

mythz commented 2 years ago

web-templates.io redis-server just uses the default configuration, which I've only modified to allow connections to remote hosts. If the defaults are going to be changed to be more restrictive then you're also going to experience consequences of having restricted them as 20 seconds for idle connections is very low for idle connections where it gets close to defeating the purpose of using a connection pool if all idle connections are going to be indiscriminately killed soon after by the server.

Whenever defaults are changed you need to also propagate the necessary changes to any of its client libraries e.g. RedisConfig.DefaultIdleTimeOutSecs (default 240s) should always be set to a value lower than the redis server timeout.

Note: We only have control over and are able to support issues with our C# libraries and why we need repro's demonstrating the issue with the client library and not the environment it's used in, e.g. the .RedisException: Exceeded timeout of 00:00:10 happens when the Redis client is unable to establish a TCP connection with the Redis Server, it can't do anything else at that point except keep auto retrying which it does up to RedisConfig.DefaultRetryTimeout (default 10s) or throw the Redis Timeout Exception seen in the stacktrace when it's exceeded.

ardove commented 2 years ago

Yeah, I understand the need for a repro, but I did give you one that repro'd for me CONSISTENTLY. There are ALWAYS going to be different environments that YOUR client are going to be used within, so I think you could both document the fact that host changes need to be propagated to clients AND be a little more supportive of people who are experiencing issues. Keep in mind that not everyone has default configurations and not everyone has the capability to modify redis configuration. We have many legacy systems that are risky to change for a variety of reasons.

There are also plenty of GOOD reasons to set a low timeout. According to the redis documentation, "there are two conditions when it makes sense to set a timeout:

We pay you money every year to use your "premium" library and to be frank, the interaction I had with you over the past few days left a really bad taste in my mouth. At one point, I even considered switching to a different library.

I realize you probably get thousands of issues per day, but you also need to realize that we are real developers dealing with real production issues using YOUR client EVERY day.

mythz commented 2 years ago

Sure set a timeout, but if it's more restrictive you'd need to check how it will affect client libraries using it, especially if you're using connection pooling which becomes less useful the lower the timeout where eventually you'll be better off not using any connection pooling by switching to use BasicRedisClientManager.

We've always required needing to have some way to be able to repro any issues, it's essentially our only requirement as trying to resolve issues you can't reproduce in isolation in environments you don't have access to is an exercise in futility that's infeasible when the turnaround time to blindly predict & creating custom library builds to test is measured in hours/days. I don't mind that your repro didn't repro the issue outside your environment, but I do need a repro I can repro in order to investigate further which is why I made our redis-server remotely available for you to have a default redis-server to test against.

Having some way to repro issues will always be a requirement, if that's unsatisfactory then I would also recommend switching to a different library or support contract with the level of support you want.