buraksezer / olric

Distributed in-memory object store. It can be used as an embedded Go library and a language-independent service.
Apache License 2.0
3.04k stars 114 forks source link

Potential lock contention: Coordinator is not able to publish routing table to new node which causes it fails to be bootstrapped #254

Open zhp007 opened 1 month ago

zhp007 commented 1 month ago

After further investigation for the issue in https://github.com/buraksezer/olric/issues/253, we find that when there is high traffic on the cluster, after we kill one pod, the new pod is not able to get the routing table from coordinator. It cannot see the log: https://github.com/buraksezer/olric/blob/master/internal/cluster/routingtable/operations.go#L92. Then it fails to call markBootstrapped later at https://github.com/buraksezer/olric/blob/master/internal/cluster/routingtable/operations.go#L116. Then DMap creation will never meet the condition IsBootstrapped at https://github.com/buraksezer/olric/blob/master/internal/cluster/routingtable/routingtable.go#L204 and thus fails with operation timeout. This happens only when we kill non-coordinator pod. If we kill coordinator pod, the new pod can come up successfully.

However, with low traffic the new node is able to get the routing table published from coordinator and then creates DMap successfully.

When there is high traffic, the coordinator is overwhelmed by the LRU deletion:

{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete replica key/value on dmap.test_table: context canceled => delete.go:82"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete expired key: test:f16e1d3b-06a4-48b6-8dfe-109915d7b1cc on DMap: dmap.test_table: canceled => eviction.go:161"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete replica key/value on dmap.test_table: context canceled => delete.go:82"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete expired key: test:9305b198-5863-439e-866f-17e3b54f5afd on DMap: dmap.test_table: canceled => eviction.go:161"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete replica key/value on dmap.test_table: context canceled => delete.go:82"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete expired key: test:63dbc9d7-0633-489f-b0b2-b9e067526d3d on DMap: dmap.test_table: canceled => eviction.go:161"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete replica key/value on dmap.test_table: context canceled => delete.go:82"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete expired key: test:b5f4fb9f-4154-4d0b-a280-41e910e94725 on DMap: dmap.test_table: canceled => eviction.go:161"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete replica key/value on dmap.test_table: context canceled => delete.go:82"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete expired key: test:12a87263-b081-4ea1-acb5-4f115b3b3219 on DMap: dmap.test_table: canceled => eviction.go:161"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete replica key/value on dmap.test_table: context canceled => delete.go:82"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete expired key: test:19cee89f-8260-41da-8b1d-79656004a60d on DMap: dmap.test_table: canceled => eviction.go:161"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete replica key/value on dmap.test_table: context canceled => delete.go:82"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete expired key: test:fa128293-2b18-4bb1-a2fe-2657e7c7043e on DMap: dmap.test_table: canceled => eviction.go:161"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete replica key/value on dmap.test_table: context canceled => delete.go:82"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete expired key: test:8bd69d69-dd3f-48d2-851f-0609e2554ac2 on DMap: dmap.test_table: canceled => eviction.go:161"}
{"level":"info","ts":"2024-05-21T02:31:42.587Z","caller":"leaf.go:238","msg":"2024/05/21 02:31:42 [ERROR] Failed to delete replica key/value on dmap.test_table: context canceled => delete.go:82"}

There are also a lot of following logs:

{"level":"info","ts":"2024-05-21T02:27:19.418Z","caller":"leaf.go:238","msg":"2024/05/21 02:27:19 [ERROR] Failed to move DMap fragment: test_table on PartID: 0 to 172.19.19.174:3320: dial tcp 172.19.19.174:3320: connect: connection refused => balancer.go:91"}
{"level":"info","ts":"2024-05-21T02:27:19.418Z","caller":"leaf.go:238","msg":"2024/05/21 02:27:19 [INFO] Moving DMap fragment: test_table (kind: Backup) on PartID: 18 to 172.19.19.174:3320 => balancer.go:86"}
{"level":"info","ts":"2024-05-21T02:27:19.533Z","caller":"leaf.go:238","msg":"2024/05/21 02:27:19 [ERROR] Failed to move DMap fragment: test_table on PartID: 18 to 172.19.19.174:3320: dial tcp 172.19.19.174:3320: connect: connection refused => balancer.go:91"}
{"level":"info","ts":"2024-05-21T02:27:19.533Z","caller":"leaf.go:238","msg":"2024/05/21 02:27:19 [INFO] Moving DMap fragment: test_table (kind: Backup) on PartID: 48 to 172.19.19.174:3320 => balancer.go:86"}
{"level":"info","ts":"2024-05-21T02:27:19.624Z","caller":"leaf.go:238","msg":"2024/05/21 02:27:19 [ERROR] Failed to move DMap fragment: test_table on PartID: 48 to 172.19.19.174:3320: dial tcp 172.19.19.174:3320: connect: connection refused => balancer.go:91"}
{"level":"info","ts":"2024-05-21T02:27:19.625Z","caller":"leaf.go:238","msg":"2024/05/21 02:27:19 [INFO] Moving DMap fragment: test_table (kind: Backup) on PartID: 81 to 172.19.19.174:3320 => balancer.go:86"}
{"level":"info","ts":"2024-05-21T02:27:19.724Z","caller":"leaf:238","msg":"2024/05/21 02:27:19 [ERROR] Failed to move DMap fragment: test_table on PartID: 81 to 172.19.19.174:3320: dial tcp 172.19.19.174:3320: connect: connection refused => balancer.go:91"}
{"level":"info","ts":"2024-05-21T02:27:19.724Z","caller":"leaf:238","msg":"2024/05/21 02:27:19 [INFO] Moving DMap fragment: test_table (kind: Backup) on PartID: 94 to 172.19.19.174:3320 => balancer.go:86"}
{"level":"info","ts":"2024-05-21T02:27:19.852Z","caller":"leaf.go:238","msg":"2024/05/21 02:27:19 [ERROR] Failed to move DMap fragment: test_table on PartID: 94 to 172.19.19.174:3320: dial tcp 172.19.19.174:3320: connect: connection refused => balancer.go:91"}
{"level":"info","ts":"2024-05-21T02:27:19.852Z","caller":"leaf.go:238","msg":"2024/05/21 02:27:19 [INFO] Moving DMap fragment: test_table (kind: Backup) on PartID: 127 to 172.19.19.174:3320 => balancer.go:86"}
{"level":"info","ts":"2024-05-21T02:27:19.913Z","caller":"leaf.go:238","msg":"2024/05/21 02:27:19 [ERROR] Failed to move DMap fragment: test_table on PartID: 127 to 172.19.19.174:3320: dial tcp 172.19.19.174:3320: connect: connection refused => balancer.go:91"}

Where 172.19.19.174 is the pod got killed.

One potential root cause is that: There are a few client locks in delete.go (https://github.com/buraksezer/olric/blob/master/internal/dmap/delete.go), update.go (https://github.com/buraksezer/olric/blob/master/internal/cluster/routingtable/update.go), put.go (https://github.com/buraksezer/olric/blob/master/internal/dmap/put.go). They are fighting to lock on the Redis client. And because there's 1000s or more evictions/deletions, the routing table code can't get the lock to broadcast the routing table. This seems to be a lock contention issue.

Besides, for initial node coming up, we also sometimes see coordinator fails to publish routing table to nodes, which then cause the node fail to create DMap due to bootstrap operation timeout.

Our testing setup: Using Olric embedded mode, 3 pods each with 4GB and 2 CPU.

config.New("lan") ReplicaCount = 2 ReplicationMode = AsyncReplicationMode LRU: DMaps.MaxKeys = 1_000_000 DMaps.MaxInuse = 1_000_000_000 BootstrapTimeout = 2 * time.Minute All others are default setting

Running 1000 QPS. Each is the sequence: Write, wait 10ms, then read. Each key-value pair is key UUID and value random 16 bytes, so total 32 bytes.

buraksezer commented 1 month ago

Possibly, there is an issue with your setup: the peer discovery code doesn't work properly. It may propagate non-existing nodes or fail to remove failed nodes from the member list.

I think there is no problem with the lock usage. These parts of the code are carefully written, reviewed, and tested over time.

Regarding the error messages, Olric is a highly concurrent system. It has many goroutines that process some data concurrently. Some of these components are not synced. Because it is not necessary. A goroutine may try to do something; if it fails, it logs the problem and continues working as usual. The error logging may be noisy sometimes. I accept that.

As I said before, the problem might be in the peer discovery code. Is it possible to share it?