Closed cobexer closed 3 years ago
Thank you for reporting. The problem is very likely to unrelated with Kubernetes. Because of this I tried to reproduce it in a multi-container Docker environment but it seems to work fine. What version of Olric do you use? I have tried with Olric v0.3.0-beta.1.
If you can provide some sample code and configuration to reproduce the issue, it would be awesome. Especially Kubernetes configuration is very critical.
Kubernetes StatefulSet:
kind: StatefulSet
apiVersion: apps/v1
metadata:
name: test-container-id-to-image-cache
namespace: playground
labels:
app.kubernetes.io/instance: test
app.kubernetes.io/name: container-id-to-image-cache
spec:
replicas: 3
selector:
matchLabels:
app.kubernetes.io/instance: test
app.kubernetes.io/name: container-id-to-image-cache
template:
metadata:
labels:
app.kubernetes.io/instance: test
app.kubernetes.io/name: container-id-to-image-cache
spec:
restartPolicy: Always
serviceAccountName: test-container-id-to-image-cache
imagePullSecrets:
- name: regcred
schedulerName: default-scheduler
terminationGracePeriodSeconds: 30
securityContext: {}
containers:
- name: test-container-id-to-image-cache
image: myregistry/container-id-to-image-cache:1.0.3-git9949bea
command:
- /app/container-id-to-image-cache
args:
- '--namespace'
- playground
- '--name'
- container-id-to-image-cache
- '--instance'
- test
- '--loglevel'
- DEBUG
resources:
limits:
cpu: 1200m
requests:
cpu: 100m
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
serviceAccount: test-container-id-to-image-cache
dnsPolicy: ClusterFirst
serviceName: test-container-id-to-image-cache
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
revisionHistoryLimit: 10
The relevant Go code I hope:
type KVCache struct {
Log *zap.SugaredLogger
LogLevel zapcore.Level
db *olric.Olric
kvmap *olric.DMap
}
func (kvc *KVCache) Start(namespace, name, instance string) error {
c := config.New("lan")
c.MaxJoinAttempts = 3600
c.JoinRetryInterval = time.Second * 5
c.ReadRepair = true
c.ReplicationMode = config.AsyncReplicationMode
c.BindPort = 4220
c.MemberlistConfig.BindPort = 3322
c.MemberlistConfig.AdvertisePort = 3322
c.MemberlistConfig.DeadNodeReclaimTime = time.Second * 5
c.LogLevel = kvc.LogLevel.String()
c.ReplicaCount = 2
c.Logger = zap.NewStdLog(kvc.Log.Desugar())
c.KeepAlivePeriod = time.Second * 20
c.DialTimeout = time.Second * 300
c.RequestTimeout = time.Second * 30
c.WriteQuorum = 1
cd := &discovery.CloudDiscovery{}
labelSelector := fmt.Sprintf("app.kubernetes.io/name=%s,app.kubernetes.io/instance=%s", name, instance)
c.ServiceDiscovery = map[string]interface{}{
"plugin": cd,
"provider": "k8s",
"args": fmt.Sprintf("namespace=%s label_selector=\"%s\"", namespace, labelSelector),
}
// Callback function. It's called when this node is ready to accept connections.
olricCtx, cancel := context.WithCancel(context.Background())
c.Started = func() {
kvc.Log.Info("KVCache: Olric is ready to accept connections")
cancel()
}
var err error
kvc.db, err = olric.New(c)
if err != nil {
return errors.Wrapf(err, "KVCache: Failed to create Olric instance: %v", err)
}
go func() {
// Call Start at background. It's a blocker call.
err = kvc.db.Start()
}()
<-olricCtx.Done() // startup finished
if err != nil {
return errors.Wrapf(err, "KVCache: olric.Start returned an error: %v", err)
}
kvc.kvmap, err = kvc.db.NewDMap("kvcache")
if err != nil {
return errors.Wrapf(err, "KVCache: faield to create DMap: %v", err)
}
return nil
}
The Kubernetes Headless service used for the peer discovery I think:
kind: Service
apiVersion: v1
metadata:
name: test-container-id-to-image-cache
namespace: playground
labels:
app.kubernetes.io/instance: test
app.kubernetes.io/name: container-id-to-image-cache
spec:
ports:
- name: olric
protocol: TCP
port: 4220
targetPort: 4220
- name: memberlist
protocol: TCP
port: 3322
targetPort: 3322
selector:
app.kubernetes.io/instance: test
app.kubernetes.io/name: container-id-to-image-cache
clusterIP: None
type: ClusterIP
sessionAffinity: None
Please let me know if you need more!
Thank you @cobexer. I'm going to try to reproduce the problem. The underlying problem can be a redirect loop. I implemented a fix in v0.3.0-beta.4. You may want to try with this version.
writes often fail with network I/O timeouts
This may indicate a problem with Kubernetes networking. Olric nodes can discover each other (via memberlist), but they cannot work together. I saw this problem before. Take a look at my Kubernetes configuration: https://github.com/buraksezer/olric-kubernetes/blob/master/olricd.yaml#L78
apiVersion: v1
kind: Service
metadata:
name: memberlist
spec:
selector:
run: olricd
clusterIP: None
ports:
- port: 3322
protocol: TCP
targetPort: 3322
---
apiVersion: v1
kind: Service
metadata:
name: olricd
spec:
selector:
run: olricd
ports:
- port: 3320
protocol: TCP
targetPort: 3320
one member uses ~100% of a CPU all the time memory use seems very high
An infinite loop and memory/goroutine leak. This can be triggered by redirect loop. We need to find the root cause of the problem.
This may indicate a problem with Kubernetes networking. Olric nodes can discover each other (via memberlist), but they cannot work together.
Olric is the only workload on the cluster that experiences network issues, and since your update I have not seen the network issues mentioned above (as I also mentioned in the other issue)
from a pod with IP 10.130.11.113:
(From a DMap.PutEx): Member names are the same: 10.130.11.113:4220 OpCode: 2: redirection cycle detected
[ERROR] There are members in the consistent hash ring with the same name and different IDs. Name: 10.130.11.113:4220, ID of this node: 9086992081852953244, ID of the target node: 3792512234746104751. Please report this.
[ERROR] Failed to get host by id: 3792512234746104751: host not found
This pod had a restart, so it rejoined the cluster with the same Hostname and IP. I deleted the stateful set Pod (so it joined the cluster with the same name and new IP) to test read repair, and after 6 seconds runtime the Pod crashed and restarted (with the same hostname and same IP)
Considering my sample code above: what can I do to allow olric nodes to both rejoin with new IPs and reuse existing IPs?
This pod had a restart, so it rejoined the cluster with the same Hostname and IP. I deleted the stateful set Pod (so it joined the cluster with the same name and new IP) to test read repair, and after 6 seconds runtime the Pod crashed and restarted (with the same hostname and same IP)
If you can control the full life-cycle of the process, you must call Shutdown
method before restarting. It broadcasts a leave message to the cluster and the other nodes removes the node gracefully. If you don't call that method, memberlist will detect the node departure event by gossiping and regular state updates. This takes some time, but it's configurable.
I haven't tried to reproduce the problem yet. But I have an idea what the problem is. Here is an evolving solution: https://github.com/buraksezer/olric/tree/fix/issue-54
Could you try to reproduce the problem on fix/issue-54?
So I tried this with the current version of the issue-54 branch, but a node crashed again (with the reported readRepair crash in #61)
The crashed node logs (even after ~25min):
[ERROR] Failed to call get on a replica owner: 10.131.6.116:4220: key not found
[ERROR] Failed to call get on a replica owner: 10.131.6.116:4220: key not found
[INFO] Routing table has been pushed by 10.129.4.64:4220
[INFO] Moving DMap: kvcache (backup: false) on PartID: 2 to 10.128.4.112:4220
[ERROR] Received DMap: kvcache on PartID: 2 (backup: false) doesn't belong to me
[ERROR] Failed to move DMap: kvcache on PartID: 2 to 10.128.4.112:4220: invalid argument
[INFO] Moving DMap: kvcache (backup: false) on PartID: 3 to 10.128.4.112:4220
[ERROR] Received DMap: kvcache on PartID: 3 (backup: false) doesn't belong to me
[ERROR] Failed to move DMap: kvcache on PartID: 3 to 10.128.4.112:4220: invalid argument
Also in case of a crash I obviously don't control the full life-cycle ;) under normal circumstances I call Shutdown to properly leave the cluster.
I have just fixed #61 and merged the fixes in #54. So v0.3.0-beta.6 includes all the fixes and improvements. It's highly recommended.
I'm still iterating over this StatefulSet deployment scenarios. Some of the lines in the logs you shared takes my attention:
[ERROR] Received DMap: kvcache on PartID: 3 (backup: false) doesn't belong to me
[ERROR] Failed to move DMap: kvcache on PartID: 3 to 10.128.4.112:4220: invalid argument
invalid argument
error sounds interesting. I'm on it.
So far v0.3.0-beta.6 is working great! Even manually crashing my microservice multiple times didn't lead to a broken olric cluster any more! Thank you!
Not sure if there are multiple callpaths but this is one of them for the [ERROR] Failed to move DMap: kvcache on PartID: 3 to 10.128.4.112:4220: invalid argument
:
github.com/buraksezer/olric/internal/flog.Verbose.Printf
/go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/internal/flog/flog.go:112
github.com/buraksezer/olric.(*Olric).rebalancePrimaryPartitions.func1
/go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/rebalancer.go:177
sync.(*Map).Range
/usr/local/go/src/sync/map.go:345
github.com/buraksezer/olric.(*Olric).rebalancePrimaryPartitions
/go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/rebalancer.go:172
github.com/buraksezer/olric.(*Olric).rebalancer
/go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/rebalancer.go:256
github.com/buraksezer/olric.(*Olric).updateRoutingOperation.func1
/go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/routing.go:501
And this one for Received DMap: kvcache on PartID: 3 (backup: false) doesn't belong to me
:
github.com/buraksezer/olric/internal/flog.Verbose.Printf
/go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/internal/flog/flog.go:112
github.com/buraksezer/olric.(*Olric).moveDMapOperation
/go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/rebalancer.go:296
github.com/buraksezer/olric.(*Olric).requestDispatcher
/go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/olric.go:264
github.com/buraksezer/olric/internal/transport.(*Server).processMessage
/go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/internal/transport/server.go:193
github.com/buraksezer/olric/internal/transport.(*Server).processConn
/go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/internal/transport/server.go:217
So far v0.3.0-beta.6 is working great! Even manually crashing my microservice multiple times didn't lead to a broken olric cluster any more! Thank you!
It's great to hear that!
Not sure if there are multiple callpaths but this is one of them for the [ERROR] Failed to move DMap: kvcache on PartID: 3 to 10.128.4.112:4220: invalid argument:
The crashed node tries to move partitions to itself. I have a possible fix for this but I need to investigate further to be sure.
@cobexer Olric v0.3.0-beta.7 is out. Now, Olric properly handles memberlist.NodeUpdate
events. I think that it's useful for your deployment scenario.
If you encounter a problem, please let me know.
I will not be able to test this before the end of november, maybe @sll552 can test the updated library?
I'm closing this issue. You feel free to re-open this, if you have any problems on this topic.
The intended use case for Olric in my application is to act as a replicated cache, that doesn't need transactions or disk persistence, but needs to keep values in case of node/datacenter failures and through upgrades.
To achieve this I added olric embedded to my application which is deployed using a Kubernetes StatefulSet with 3 replicas.
To discover the other replicas of my application for olric I use the olric-cloud-plugin, and after some guessing and trial and error I got it to discover the other peers and have olric seem to form a cluster.
However with 3 replicas of my application, MinimumReplicaCount = 2 and ReplicationMode = AsyncReplicationMode olric seems to be unhappy:
I still need to try this use case with olricd to see if I can reproduce the problem there.
Now I know that my application (because it is still in it's early stages) has a dumb behavior: the data that needs to be cached comes from the kubernetes API server and all nodes receive the same data, and also all write the same values to the olric DMap. You documentation leads me to believe this should be fine, because the last write will win.
I guess I'll have to write a sample application to demonstrate the problem... but I don't have access to the applications code right now, so I will add that later.