Entire cluster fail to serve requests when one pod fails to restart with DMap creation operation timeout

zhp007 commented 1 month ago

One follow up with https://github.com/buraksezer/olric/issues/251

We are using Olric embedded mode to build a cache service with the config: Config env: "wan" PartitionCount: 271 ReplicaCount: 2 ReplicationMode: AsyncReplicationMode We create only one DMap.

We are testing with 3 pods. When the 1000 QPS traffic was ongoing, we killed one pod to test one failure scenario.

The new pod fail to create DMap with NewDMap with the following error:

operation timeout

If DMap creation fails, then we fail the pod creation. Thus new pod creation kept failing with above error.

And other nodes see different kinds of errors. Here are some samples:

1.

[ERROR] Failed to delete replica key/value on dmap.test_table: dial tcp 172.19.46.26:3320: connect: no route to host => delete.go:82"}

2.

[INFO] Moving DMap fragment: test_table (kind: Backup) on PartID: 270 to 172.19.21.125:3320 => balancer.go:86"}
[ERROR] Failed to move DMap fragment: test_table on PartID: 270 to 172.19.21.125:3320: dial tcp 172.19.21.125:3320: connect: connection refused => balancer.go:91"}

Then entire cluster of 3 nodes failed to serve the incoming requests.

Besides, we sometimes observed operation timeout on DMap creation with the same setup even when there is no traffic.

buraksezer commented 1 month ago

That's interesting. I had never seen such a problem before. I'll try to reproduce it but it is possibly related to your network environment. I predict that you are trying to deploy an Olric cluster on Kubernetes. How do the nodes discover each other? There is a plug-in to discover nodes in Kubernetes env but it is not properly maintained.

Config env: "wan"

People generally use "lan" as the network environment for memberlist configuration. memberlist configuration can be tricky.

Are you using the cluster client to connect to the cluster? I guess there is a subtle issue in your network setup.

zhp007 commented 1 month ago

We are deploying Olric with embedded mode and use the approach in https://github.com/buraksezer/olric/issues/195 for service discovery.

We also tried https://github.com/buraksezer/olric-cloud-plugin as well as setting static member list. For all of these approaches, we see DMap creation operation timeout:

During cluster setup, happens sometimes but later pod can come up and form the cluster.
After killing one pod (in 3-pod cluster) when there is traffic, then new pod can never come up with DMap operation timeout.

If there is network problem, I assume there will also be failure for the 1st case, but it always succeeds.

For the 2nd case, traffic to other running nodes all failed with either errors like tcp 172.19.22.75:3320: connect: connection refused or deadline exceeded/canceled, and the entire cluster cannot serve traffic anymore.

But as soon we stop the traffic, the restart pods can succeed and receiving routing table and DMap, :

[INFO] Routing table has been pushed by 172.18.140.44:3320 => operations.go:92"}
[INFO] Received DMap (kind: Primary): realtime-leaf on PartID: 7 => balance.go:128"}
[INFO] Received DMap (kind: Backup): realtime-leaf on PartID: 270 => balance.go:128"}

But when there is traffic, we cannot even see the 1st line of logs, DMap creation just fail with operation timeout.

Also markBootstrap seems to be the prerequisite for DMap CheckBootstrap. And we didn't see https://github.com/buraksezer/olric/blob/master/internal/cluster/routingtable/operations.go#L92 in logs, it means startup cannot reach this place.

We are using EmbeddedClient on each of the node to connect to Olric.

buraksezer commented 1 month ago

This is too difficult for me to analyze because I cannot reproduce the problem here. Possibly, the peer discovery code fails to propagate or remove dead nodes from the system.

olric-cloud-plugin is an abandonware. I last tested it on Kubernetes a long time ago.
Static peer list is just for playing with Olric on localhost.
Error logs are normal. Olric can be too chatty about network problems. You can try to decrease the verbosity level. Checkout this https://github.com/buraksezer/olric/blob/81e12546eb39f906efdc4afbb0fb13b61a4ea64d/pkg/flog/flog.go#L27

derekperkins commented 1 month ago

@zhp007 We are still using the same code from the gist in https://github.com/buraksezer/olric/issues/195, and I haven't ever seen indications that there have been problems with it, with some very aggressive autoscaling set up. Here's the config we use:

// create a new Olric configuration
cfg := config.New("lan") // default configuration
cfg.ServiceDiscovery = map[string]any{
"plugin": k8sDisc,
}
cfg.ReplicationMode = config.AsyncReplicationMode
cfg.LogLevel = "WARN"
cfg.LogVerbosity = 1

buraksezer / olric

Entire cluster fail to serve requests when one pod fails to restart with DMap creation operation timeout #253