Open zhp007 opened 1 month ago
Possibly, there is an issue with your setup: the peer discovery code doesn't work properly. It may propagate non-existing nodes or fail to remove failed nodes from the member list.
I think there is no problem with the lock usage. These parts of the code are carefully written, reviewed, and tested over time.
Regarding the error messages, Olric is a highly concurrent system. It has many goroutines that process some data concurrently. Some of these components are not synced. Because it is not necessary. A goroutine may try to do something; if it fails, it logs the problem and continues working as usual. The error logging may be noisy sometimes. I accept that.
As I said before, the problem might be in the peer discovery code. Is it possible to share it?
After further investigation for the issue in https://github.com/buraksezer/olric/issues/253, we find that when there is high traffic on the cluster, after we kill one pod, the new pod is not able to get the routing table from coordinator. It cannot see the log: https://github.com/buraksezer/olric/blob/master/internal/cluster/routingtable/operations.go#L92. Then it fails to call
markBootstrapped
later at https://github.com/buraksezer/olric/blob/master/internal/cluster/routingtable/operations.go#L116. Then DMap creation will never meet the conditionIsBootstrapped
at https://github.com/buraksezer/olric/blob/master/internal/cluster/routingtable/routingtable.go#L204 and thus fails with operation timeout. This happens only when we kill non-coordinator pod. If we kill coordinator pod, the new pod can come up successfully.However, with low traffic the new node is able to get the routing table published from coordinator and then creates DMap successfully.
When there is high traffic, the coordinator is overwhelmed by the LRU deletion:
There are also a lot of following logs:
Where 172.19.19.174 is the pod got killed.
One potential root cause is that: There are a few client locks in delete.go (https://github.com/buraksezer/olric/blob/master/internal/dmap/delete.go), update.go (https://github.com/buraksezer/olric/blob/master/internal/cluster/routingtable/update.go), put.go (https://github.com/buraksezer/olric/blob/master/internal/dmap/put.go). They are fighting to lock on the Redis client. And because there's 1000s or more evictions/deletions, the routing table code can't get the lock to broadcast the routing table. This seems to be a lock contention issue.
Besides, for initial node coming up, we also sometimes see coordinator fails to publish routing table to nodes, which then cause the node fail to create DMap due to bootstrap operation timeout.
Our testing setup: Using Olric embedded mode, 3 pods each with 4GB and 2 CPU.
config.New("lan") ReplicaCount = 2 ReplicationMode = AsyncReplicationMode LRU: DMaps.MaxKeys = 1_000_000 DMaps.MaxInuse = 1_000_000_000 BootstrapTimeout = 2 * time.Minute All others are default setting
Running 1000 QPS. Each is the sequence: Write, wait 10ms, then read. Each key-value pair is key UUID and value random 16 bytes, so total 32 bytes.