buraksezer / olric

Distributed, in-memory key/value store and cache. It can be used as an embedded Go library and a language-independent service.
Apache License 2.0
3.09k stars 115 forks source link

Olric embedded in a Kubernetes StatefulSet throws network errors, rejects writes and uses too much CPU #54

Closed cobexer closed 3 years ago

cobexer commented 4 years ago

The intended use case for Olric in my application is to act as a replicated cache, that doesn't need transactions or disk persistence, but needs to keep values in case of node/datacenter failures and through upgrades.

To achieve this I added olric embedded to my application which is deployed using a Kubernetes StatefulSet with 3 replicas.

To discover the other replicas of my application for olric I use the olric-cloud-plugin, and after some guessing and trial and error I got it to discover the other peers and have olric seem to form a cluster.

However with 3 replicas of my application, MinimumReplicaCount = 2 and ReplicationMode = AsyncReplicationMode olric seems to be unhappy:

I still need to try this use case with olricd to see if I can reproduce the problem there.

Now I know that my application (because it is still in it's early stages) has a dumb behavior: the data that needs to be cached comes from the kubernetes API server and all nodes receive the same data, and also all write the same values to the olric DMap. You documentation leads me to believe this should be fine, because the last write will win.

I guess I'll have to write a sample application to demonstrate the problem... but I don't have access to the applications code right now, so I will add that later.

buraksezer commented 4 years ago

Thank you for reporting. The problem is very likely to unrelated with Kubernetes. Because of this I tried to reproduce it in a multi-container Docker environment but it seems to work fine. What version of Olric do you use? I have tried with Olric v0.3.0-beta.1.

If you can provide some sample code and configuration to reproduce the issue, it would be awesome. Especially Kubernetes configuration is very critical.

cobexer commented 3 years ago

Kubernetes StatefulSet:

kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: test-container-id-to-image-cache
  namespace: playground
  labels:
    app.kubernetes.io/instance: test
    app.kubernetes.io/name: container-id-to-image-cache
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/instance: test
      app.kubernetes.io/name: container-id-to-image-cache
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: test
        app.kubernetes.io/name: container-id-to-image-cache
    spec:
      restartPolicy: Always
      serviceAccountName: test-container-id-to-image-cache
      imagePullSecrets:
        - name: regcred
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      securityContext: {}
      containers:
        - name: test-container-id-to-image-cache
          image: myregistry/container-id-to-image-cache:1.0.3-git9949bea
          command:
            - /app/container-id-to-image-cache
          args:
            - '--namespace'
            - playground
            - '--name'
            - container-id-to-image-cache
            - '--instance'
            - test
            - '--loglevel'
            - DEBUG
          resources:
            limits:
              cpu: 1200m
            requests:
              cpu: 100m
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      serviceAccount: test-container-id-to-image-cache
      dnsPolicy: ClusterFirst
  serviceName: test-container-id-to-image-cache
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  revisionHistoryLimit: 10
cobexer commented 3 years ago

The relevant Go code I hope:

type KVCache struct {
    Log      *zap.SugaredLogger
    LogLevel zapcore.Level
    db       *olric.Olric
    kvmap    *olric.DMap
}

func (kvc *KVCache) Start(namespace, name, instance string) error {
    c := config.New("lan")
    c.MaxJoinAttempts = 3600
    c.JoinRetryInterval = time.Second * 5
    c.ReadRepair = true
    c.ReplicationMode = config.AsyncReplicationMode
    c.BindPort = 4220
    c.MemberlistConfig.BindPort = 3322
    c.MemberlistConfig.AdvertisePort = 3322
    c.MemberlistConfig.DeadNodeReclaimTime = time.Second * 5
    c.LogLevel = kvc.LogLevel.String()
    c.ReplicaCount = 2
    c.Logger = zap.NewStdLog(kvc.Log.Desugar())
    c.KeepAlivePeriod = time.Second * 20
    c.DialTimeout = time.Second * 300
    c.RequestTimeout = time.Second * 30
    c.WriteQuorum = 1
    cd := &discovery.CloudDiscovery{}
    labelSelector := fmt.Sprintf("app.kubernetes.io/name=%s,app.kubernetes.io/instance=%s", name, instance)
    c.ServiceDiscovery = map[string]interface{}{
        "plugin":   cd,
        "provider": "k8s",
        "args":     fmt.Sprintf("namespace=%s label_selector=\"%s\"", namespace, labelSelector),
    }

    // Callback function. It's called when this node is ready to accept connections.
    olricCtx, cancel := context.WithCancel(context.Background())
    c.Started = func() {
        kvc.Log.Info("KVCache: Olric is ready to accept connections")
        cancel()
    }

    var err error
    kvc.db, err = olric.New(c)

    if err != nil {
        return errors.Wrapf(err, "KVCache: Failed to create Olric instance: %v", err)
    }

    go func() {
        // Call Start at background. It's a blocker call.
        err = kvc.db.Start()
    }()

    <-olricCtx.Done() // startup finished

    if err != nil {
        return errors.Wrapf(err, "KVCache: olric.Start returned an error: %v", err)
    }

    kvc.kvmap, err = kvc.db.NewDMap("kvcache")

    if err != nil {
        return errors.Wrapf(err, "KVCache: faield to create DMap: %v", err)
    }

    return nil
}
cobexer commented 3 years ago

The Kubernetes Headless service used for the peer discovery I think:

kind: Service
apiVersion: v1
metadata:
  name: test-container-id-to-image-cache
  namespace: playground
  labels:
    app.kubernetes.io/instance: test
    app.kubernetes.io/name: container-id-to-image-cache
spec:
  ports:
    - name: olric
      protocol: TCP
      port: 4220
      targetPort: 4220
    - name: memberlist
      protocol: TCP
      port: 3322
      targetPort: 3322
  selector:
    app.kubernetes.io/instance: test
    app.kubernetes.io/name: container-id-to-image-cache
  clusterIP: None
  type: ClusterIP
  sessionAffinity: None
cobexer commented 3 years ago

Please let me know if you need more!

buraksezer commented 3 years ago

Thank you @cobexer. I'm going to try to reproduce the problem. The underlying problem can be a redirect loop. I implemented a fix in v0.3.0-beta.4. You may want to try with this version.

writes often fail with network I/O timeouts

This may indicate a problem with Kubernetes networking. Olric nodes can discover each other (via memberlist), but they cannot work together. I saw this problem before. Take a look at my Kubernetes configuration: https://github.com/buraksezer/olric-kubernetes/blob/master/olricd.yaml#L78

apiVersion: v1
kind: Service
metadata:
  name: memberlist
spec:
  selector:
    run: olricd
  clusterIP: None
  ports:
  - port: 3322
    protocol: TCP
    targetPort: 3322

---

apiVersion: v1
kind: Service
metadata:
  name: olricd
spec:
  selector:
    run: olricd
  ports:
  - port: 3320
    protocol: TCP
    targetPort: 3320

one member uses ~100% of a CPU all the time memory use seems very high

An infinite loop and memory/goroutine leak. This can be triggered by redirect loop. We need to find the root cause of the problem.

cobexer commented 3 years ago

This may indicate a problem with Kubernetes networking. Olric nodes can discover each other (via memberlist), but they cannot work together.

Olric is the only workload on the cluster that experiences network issues, and since your update I have not seen the network issues mentioned above (as I also mentioned in the other issue)

from a pod with IP 10.130.11.113:

(From a DMap.PutEx): Member names are the same: 10.130.11.113:4220 OpCode: 2: redirection cycle detected
[ERROR] There are members in the consistent hash ring with the same name and different IDs. Name: 10.130.11.113:4220, ID of this node: 9086992081852953244, ID of the target node: 3792512234746104751. Please report this.
[ERROR] Failed to get host by id: 3792512234746104751: host not found

This pod had a restart, so it rejoined the cluster with the same Hostname and IP. I deleted the stateful set Pod (so it joined the cluster with the same name and new IP) to test read repair, and after 6 seconds runtime the Pod crashed and restarted (with the same hostname and same IP)

Considering my sample code above: what can I do to allow olric nodes to both rejoin with new IPs and reuse existing IPs?

buraksezer commented 3 years ago

This pod had a restart, so it rejoined the cluster with the same Hostname and IP. I deleted the stateful set Pod (so it joined the cluster with the same name and new IP) to test read repair, and after 6 seconds runtime the Pod crashed and restarted (with the same hostname and same IP)

If you can control the full life-cycle of the process, you must call Shutdown method before restarting. It broadcasts a leave message to the cluster and the other nodes removes the node gracefully. If you don't call that method, memberlist will detect the node departure event by gossiping and regular state updates. This takes some time, but it's configurable.

I haven't tried to reproduce the problem yet. But I have an idea what the problem is. Here is an evolving solution: https://github.com/buraksezer/olric/tree/fix/issue-54

Could you try to reproduce the problem on fix/issue-54?

cobexer commented 3 years ago

So I tried this with the current version of the issue-54 branch, but a node crashed again (with the reported readRepair crash in #61)

The crashed node logs (even after ~25min):

[ERROR] Failed to call get on a replica owner: 10.131.6.116:4220: key not found
[ERROR] Failed to call get on a replica owner: 10.131.6.116:4220: key not found
[INFO] Routing table has been pushed by 10.129.4.64:4220
[INFO] Moving DMap: kvcache (backup: false) on PartID: 2 to 10.128.4.112:4220
[ERROR] Received DMap: kvcache on PartID: 2 (backup: false) doesn't belong to me
[ERROR] Failed to move DMap: kvcache on PartID: 2 to 10.128.4.112:4220: invalid argument
[INFO] Moving DMap: kvcache (backup: false) on PartID: 3 to 10.128.4.112:4220
[ERROR] Received DMap: kvcache on PartID: 3 (backup: false) doesn't belong to me
[ERROR] Failed to move DMap: kvcache on PartID: 3 to 10.128.4.112:4220: invalid argument

Also in case of a crash I obviously don't control the full life-cycle ;) under normal circumstances I call Shutdown to properly leave the cluster.

buraksezer commented 3 years ago

I have just fixed #61 and merged the fixes in #54. So v0.3.0-beta.6 includes all the fixes and improvements. It's highly recommended.

I'm still iterating over this StatefulSet deployment scenarios. Some of the lines in the logs you shared takes my attention:

[ERROR] Received DMap: kvcache on PartID: 3 (backup: false) doesn't belong to me
[ERROR] Failed to move DMap: kvcache on PartID: 3 to 10.128.4.112:4220: invalid argument

invalid argument error sounds interesting. I'm on it.

cobexer commented 3 years ago

So far v0.3.0-beta.6 is working great! Even manually crashing my microservice multiple times didn't lead to a broken olric cluster any more! Thank you!

Not sure if there are multiple callpaths but this is one of them for the [ERROR] Failed to move DMap: kvcache on PartID: 3 to 10.128.4.112:4220: invalid argument:

github.com/buraksezer/olric/internal/flog.Verbose.Printf
    /go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/internal/flog/flog.go:112
github.com/buraksezer/olric.(*Olric).rebalancePrimaryPartitions.func1
    /go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/rebalancer.go:177
sync.(*Map).Range
    /usr/local/go/src/sync/map.go:345
github.com/buraksezer/olric.(*Olric).rebalancePrimaryPartitions
    /go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/rebalancer.go:172
github.com/buraksezer/olric.(*Olric).rebalancer
    /go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/rebalancer.go:256
github.com/buraksezer/olric.(*Olric).updateRoutingOperation.func1
    /go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/routing.go:501

And this one for Received DMap: kvcache on PartID: 3 (backup: false) doesn't belong to me:

github.com/buraksezer/olric/internal/flog.Verbose.Printf
    /go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/internal/flog/flog.go:112
github.com/buraksezer/olric.(*Olric).moveDMapOperation
    /go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/rebalancer.go:296
github.com/buraksezer/olric.(*Olric).requestDispatcher
    /go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/olric.go:264
github.com/buraksezer/olric/internal/transport.(*Server).processMessage
    /go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/internal/transport/server.go:193
github.com/buraksezer/olric/internal/transport.(*Server).processConn
    /go/pkg/mod/github.com/buraksezer/olric@v0.3.0-beta.6/internal/transport/server.go:217
buraksezer commented 3 years ago

So far v0.3.0-beta.6 is working great! Even manually crashing my microservice multiple times didn't lead to a broken olric cluster any more! Thank you!

It's great to hear that!

Not sure if there are multiple callpaths but this is one of them for the [ERROR] Failed to move DMap: kvcache on PartID: 3 to 10.128.4.112:4220: invalid argument:

The crashed node tries to move partitions to itself. I have a possible fix for this but I need to investigate further to be sure.

buraksezer commented 3 years ago

@cobexer Olric v0.3.0-beta.7 is out. Now, Olric properly handles memberlist.NodeUpdate events. I think that it's useful for your deployment scenario.

If you encounter a problem, please let me know.

cobexer commented 3 years ago

I will not be able to test this before the end of november, maybe @sll552 can test the updated library?

buraksezer commented 3 years ago

I'm closing this issue. You feel free to re-open this, if you have any problems on this topic.