Hostname is incorrect after deploying to k8s

bitwalker / libcluster

Automatic cluster formation/healing for Elixir applications

MIT License

1.96k stars 187 forks source link

Hostname is incorrect after deploying to k8s #111

Open gabrielrinaldi opened 5 years ago

gabrielrinaldi commented 5 years ago

I am using Cluster.Strategy.Kubernetes.DNSSRV for the strategy. Everything works fine on the initial deploy. But after a few deploys I get the following error:

[backend-elixir-0] {"context":{"runtime":{"application":"libcluster","file":"lib/logger.ex","function":"error/2","line":17,"module_name":"Cluster.Logger","vm_pid":"<0.3785.0>"},"system":{"hostname":"backend-elixir-0","pid":1}},"dt":"2019-10-06T23:46:44.601190Z","event":null,"level":"error","message":"[libcluster:k8s] 'backend-elixir.backend-elixir-staging.svc.cluster.local.' : lookup against backend-elixir failed: :nxdomain"}
[backend-elixir-0] {"context":{"runtime":{"application":"libcluster","file":"lib/logger.ex","function":"error/2","line":17,"module_name":"Cluster.Logger","vm_pid":"<0.3785.0>"},"system":{"hostname":"backend-elixir-0","pid":1}},"dt":"2019-10-06T23:46:44.604630Z","event":null,"level":"error","message":"[libcluster:k8s] 'backend-elixir.backend-elixir-staging.svc.cluster.local.' : lookup against backend-elixir failed: :nxdomain"}

The only way to recover from it is delete the mnesia cache disk for each of the pods and restart everything.

The only thing using mnesia so far is pow.

I am new to Elixir, but I am happy to gather more information if someone points me in the right direction.

danschultzer commented 5 years ago

Does it work if you restart without deleting the Mnesia disk files?

gabrielrinaldi commented 5 years ago

No, I get stuck with a CrashLoop. I did more testing, and here is what I found:

When one pod has the master node tables set and the other doesn't
- I can deploy without breaking anything
- But if I delete both pods I get the same error
When of the pods have the master node tables set
- Any deploy will break it
If both pods have the master node tables set
- I haven't been able to document this yet

danschultzer commented 5 years ago

What is the log for the app when the CrashLoop occurs? The same as above with :nxdomain error?

Also when you say master node tables, what do you refer to? In the Mnesia cluster all nodes are master nodes.

bitwalker commented 5 years ago

The initial error there is pretty clear, though I imagine the confusion is in why it happens - the lookup for the backend-elixir service fails because it is unable to resolve any records for the domain, and there are no records because there are no active pods for that service.

Without additional context it isn't clear when this error is occuring relative to the startup of the node, but I'm assuming it happens right away with the first pod after the previous instances were terminated, and before the new pods have been marked as live and thus made available as instances of the service from k8s perspective.

Mnesia really requires something akin to StatefulSets - it is not friendly to environments where nodes come and go freely, it expects a certain set of nodes to be part of a cluster, and for membership changes in that cluster to be relatively rare, and those changes require coordination. Furthermore, it expects that the cluster is formed by bringing up all of the nodes, creating a schema with the set of nodes participating in the Mnesia cluster, then starting Mnesia on all of those nodes (including the one where the schema is first created). Once the schema is replicated, then all of the nodes can be terminated safely and brought back up, but only if the schema data is persistent, otherwise you either have to start from scratch again, or bring up new nodes, use add_table_copy to introduce the new nodes to the cluster, then start Mnesia on those nodes. You may also see issues if you create a schema on more than one node, and then connect those nodes, since naturally the schemas will conflict. You cannot create a schema until the cluster is formed, unless you do so on only one node, then use add_table_copy like I mentioned above.

I don't believe this is a libcluster issue per se, but we'll need more information for sure in order to diagnose the root cause. The initial :nxdomain error is pretty clear as I explained above, but even if you address that specific error, the bigger issue here is that new pods are not able to connect to the old cluster to sync the Mnesia schema before the old pods are shutdown. This presumably causes the new pods to try and create the schema themselves, which will occur independently on all of them since the new pods won't see each other until they have SRV records, which doesn't happen until k8s marks them as live. Ultimately I think you will need to use something like StatefulSets to ensure that you can bring a cluster up with a fixed set of nodes properly; and manually introduce new nodes to that cluster as needed (as well as remove them).

danschultzer commented 5 years ago

Thanks for the detailed description @bitwalker!

@gabrielrinaldi has mentioned to me in the elixir slack channel that https://github.com/Shopify/kubernetes-deploy is used for deploys .

The Mnesia cache in Pow is a GenServer that automatically handles initialization and replication in Mnesia as long as a list of existing cluster nodes are set for the init callback. This is usually done by providing the value of Node.list() when starting the Mnesia cache. If any of the provided nodes has Mnesia running then it'll start replicating from the cluster, otherwise it'll load from what's persisted to disk (or create a new empty schema).

I've been helping with a guide on how to integrate libcluster with Pow/Mnesia.

The following setup with Cluster.Strategy.Kubernetes is what has worked for me, to ensure that a list of cluster nodes is passed on to the Mnesia cache:

  def start(_type, _args) do
    topologies = [
      example: [
        strategy: Cluster.Strategy.Kubernetes,
        config: [
          # ...
        ]
      ]
    ]

    # List all child processes to be supervised
    children = [
      {Cluster.Supervisor, [topologies, [name: MyApp.ClusterSupervisor]]},
      MyApp.MnesiaClusterSupervisor,
      # Start the Ecto repository
      MyApp.Repo,
      # Start the endpoint when the application starts
      MyAppWeb.Endpoint
      # Starts a worker by calling: MyApp.Worker.start_link(arg)
      # {MyApp.Worker, arg},
    ]

    # See https://hexdocs.pm/elixir/Supervisor.html
    # for other strategies and supported options
    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
    Supervisor.start_link(children, opts)
  end

defmodule MyApp.MnesiaClusterSupervisor do
  use Supervisor

  def start_link(init_arg) do
    Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
  end

  @impl true
  def init(_init_arg) do
    children = [
      {Pow.Store.Backend.MnesiaCache, extra_db_nodes: Node.list()},
      Pow.Store.Backend.MnesiaCache.Unsplit
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end
end

But I don't know if this is robust? Would it be better to restart the Mnesia cache in the :connect opt callback with the DNS SRV setup?

There is support for netsplit recovery btw. The thing I want to make sure of is that the Mnesia cache GenServer starts with at least one active cluster node (if a cluster exists) to init replication.

bitwalker commented 5 years ago

@danschultzer Yeah that's more or less what I was expecting. The problem(s) I described in my last comment are definitely possible with that setup.

There are few factors at play:

A service requires pods to be started and pass their readiness/liveness checks before being added as active pods to that service. Until that happens, there will be no SRV records for that service domain
Each pod can't finish starting without starting Mnesia, which requires either initializing a new cluster with an empty set of members, or joining an existing cluster, which can only happen if there are active pods in the service pool.
Since the list of active pods is fetched almost immediately during startup, it is very likely that it will be incomplete or empty; this heavily depends on the deployment configuration in k8s
With the way Pow works, the cluster is never truly given a chance to form first before Mnesia is initialized, it either joins a cluster, or immediately assumes it is forming a new cluster and initializes/starts Mnesia without a node list. There is an inherent race condition here in that two nodes that haven't found each other yet will both assume they are the first node in the cluster, and subsequently fail once they try to connect to each other. This can even happen between old nodes and new ones, where the new node initializes Mnesia, then connects to the pre-existing cluster.

NOTE: It is not clear from the kubernetes-deploy docs what all it does, at least at a glance, so what would be particularly helpful is a dump of the configuration for the Deployment/ReplicaSets involved.

The best setup here in my opinion is to be able to boot a node, have it fully start without starting Mnesia, but only be marked ready, not live; and once started, the node tries to join up with a pre-existing cluster for some period of time before deciding to form a new cluster. Once Mnesia is started, the components which depend on Mnesia are then able to be started as well (and depending on how you approach it, may be able to automatically start on their own once they see that Mnesia is ready). This technique is known to me as "stacking theory", and you can read more about the idea here. The key is that you don't necessarily try to accomplish absolutely everything in the initialization of the supervisor tree - some things should necessarily be deferred and lazily instantiated once they are ready, and as that happens, the system approaches operational readiness. Your liveness checks then just evaluate whether the system has fully started or not (or has at least obtained enough operational capability to be allowed to start servicing requests).

The bottom line though, is that the current setup you've outlined is racy, and I believe that is the main culprit here. I could be more precise given more information about the exact sequence in which events occur (i.e. how deploys are rolled out, step by step, pod by pod); but hopefully what I've outlined here clarifies the issue well enough to allow you to troubleshoot. I really do want to stress that I don't think Mnesia is compatible with cattle-style nodes, it is designed for pet-style/persistent nodes. This also simplifies a lot of things (such as which node is responsible for initializing the schema/cluster), whereas trying to make Mnesia work with a shifting set of anonymous nodes is very fragile.

danschultzer commented 5 years ago

Thanks for the excellent response @bitwalker! That makes a lot of sense.

I found this blog post that goes into StatefulSet with Cassandra cluster which I think is what is needed here: https://medium.com/velotio-perspectives/exploring-upgrade-strategies-for-stateful-sets-in-kubernetes-c02b8286f251

@gabrielrinaldi what is your deploy strategy? Can you share the YAML?

Also any other info would help.

gabrielrinaldi commented 5 years ago

@bitwalker I am almost convinced that redis would be a better option on a Kubernetes environment. Imagine that you have multiple deploys a day, the chances of something breaking is just too high.

@danschultzer here is our yml file:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: backend-elixir
  namespace: backend-elixir-staging
  annotations:
    fluentbit.io/parser: docker
spec:
  serviceName: backend-elixir
  replicas: 2
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: backend-elixir
  template:
    metadata:
      labels:
        app: backend-elixir
      annotations:
        fluentbit.io/parser: docker
    spec:
      containers:
      - name: backend-elixir
        image: us.gcr.io/aero-falcon/backend-elixir:<%= current_sha %>
        resources:
          requests:
            memory: "1000Mi"
            cpu: "500m"
          limits:
            memory: "1500Mi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 4001
            httpHeaders:
              - name: X-Custom-Header
                value: Awesome
          timeoutSeconds: 5
          periodSeconds: 10
          initialDelaySeconds: 15
        readinessProbe:
          httpGet:
            path: /healthz
            port: 4001
            httpHeaders:
              - name: X-Custom-Header
                value: Awesome
          timeoutSeconds: 10
          periodSeconds: 10
          initialDelaySeconds: 20
        imagePullPolicy: Always
        ports:
        - containerPort: 4000
          name: agent
        - containerPort: 4001
          name: api
        - containerPort: 4369
          name: remote
        - containerPort: 9001
          name: remote1
        envFrom:
        - secretRef:
            name: backend-elixir
        - configMapRef:
            name: backend-elixir
        securityContext:
          privileged: true
          allowPrivilegeEscalation: true
        volumeMounts:
        - name: cache
          mountPath: /mnesia_cache
  volumeClaimTemplates:
  - metadata:
      name: cache
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: blazing
      resources:
        requests:
          storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: db-proxy
  namespace: backend-elixir-staging
spec:
  replicas: 1
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: db-proxy
  template:
    metadata:
      labels:
        app: db-proxy
    spec:
      affinity:
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 50
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                  - key: app
                    operator: In
                    values:
                    - db-proxy
                topologyKey: kubernetes.io/hostname
      volumes:
      - name: cloudsql-instance-credentials
        secret:
          secretName: cloudsql-instance-credentials
      containers:
      - name: cloudsql-proxy
        image: gcr.io/cloudsql-docker/gce-proxy:1.11
        command: ["/cloud_sql_proxy",
                  "-instances=aero-falcon:us-west1:aero-staging=tcp:0.0.0.0:5432",
                  "-credential_file=/secrets/cloudsql/cloud-proxy-staging.json"]
        securityContext:
          runAsUser: 2  # non-root user
          allowPrivilegeEscalation: false
        volumeMounts:
          - name: cloudsql-instance-credentials
            mountPath: /secrets/cloudsql
            readOnly: true

gabrielrinaldi commented 5 years ago

@danschultzer our update strategy is to pull one node at a time, put the new one back in and wait for it to pass the health check. Then repeat with the rest of the pods.

bitwalker commented 5 years ago

@gabrielrinaldi In a cluster with dynamic membership, I would agree with you that Redis is a better choice, though they aren't really 1:1, as Mnesia is a distributed key/value store while Redis is not. If you are able to use Redis to solve the same problem, then Mnesia is definitely not worth the operational overhead in my opinion.

That said, in a cluster with static membership, Mnesia is able to provide some nice benefits if your use case is a good fit, notably for read-heavy workloads.

Since you are using StatefulSet here, you have the foundation needed for the static cluster. I think aside from the issues I've already mentioned, part of the problem here may be that libcluster is resolving an IP and using that to connect nodes, and what happens is that at some point the IP for a given node (say backend-elixir-0) changes, and this throws things out of wack. This is fixable in libcluster; and in your case, you can test whether it addresses the issue for you:

Change from using the DNSSRV libcluster strategy to the ErlangHosts strategy. You should probably be using that anyway right now until you get some code in place to handle dealing with adding new replicas to the cluster properly
Add a .hosts.erlang file to the $HOME directory of the container, with the following entries:

'app@backend-elixir-0.backend-elixir.backend-elixir-staging.svc.cluster.local'
'app@backend-elixir-1.backend-elixir.backend-elixir-staging.svc.cluster.local'

The above assumes a cluster of two pods, and since you haven't shared the Service definition, I'm guessing at the name (which is the second component of each DNS name). Likewise, the Erlang node name I've just filled in as app, but should be changed to whatever your node names are. Add more entries for the number of replicas you have.

Restart your cluster from scratch (i.e. delete the Mnesia disk copies - back them up if you need to save the data though), and then try and replicate the issue by performing a few rolling updates. If you are able to reproduce the problem again, then we know it is due to one of the other issues and not how libcluster is connecting the nodes in the DNSSRV strategy.

NOTE: If you are new to Elixir, trying to dive into using Mnesia right out of the gate may not be worth the effort. If it is the right solution to your problem, then that is one thing, just be aware that like many distributed systems, it takes additional preparation and maintenance to keep things running smoothly, it is not a "set it and forget it" component of OTP. That said, this is a great opportunity to learn more about how it works, so if you have some time to apply to figuring it out now, it is worth trying to at least get to the bottom of this particular problem.

bitwalker commented 5 years ago

Turns out I was mistaken about how libcluster resolves the IPs when using the DNSSRV strategy - it actually does use the DNS name, not the IP. Nevertheless, may still be worth switching to ErlangHosts for simplicity. However, it may also be that the nodes themselves are using their IP as their node hostname, and not the DNS, i.e. backend-elixir@x.x.x.x, rather than backend-elixir@backend-elixir-0.backend-elixir.backend-elixir-staging.cluster.svc.local. I'm not 100% on whether that matters to Mnesia, but keeping those linked up would eliminate one more possible issue from the list.

I'm still not 100% sure whether the fact that a node starts up without a DNS record is part of the issue here or not. If it is, then that complicates things one way or another, unless k8s has an easy way to assign static IPs to pods that I don't know about.