Open gabrielrinaldi opened 5 years ago
Does it work if you restart without deleting the Mnesia disk files?
No, I get stuck with a CrashLoop. I did more testing, and here is what I found:
When one pod has the master node tables set and the other doesn't
When of the pods have the master node tables set
If both pods have the master node tables set
What is the log for the app when the CrashLoop occurs? The same as above with :nxdomain
error?
Also when you say master node tables, what do you refer to? In the Mnesia cluster all nodes are master nodes.
The initial error there is pretty clear, though I imagine the confusion is in why it happens - the lookup for the backend-elixir
service fails because it is unable to resolve any records for the domain, and there are no records because there are no active pods for that service.
Without additional context it isn't clear when this error is occuring relative to the startup of the node, but I'm assuming it happens right away with the first pod after the previous instances were terminated, and before the new pods have been marked as live and thus made available as instances of the service from k8s perspective.
Mnesia really requires something akin to StatefulSets - it is not friendly to environments where nodes come and go freely, it expects a certain set of nodes to be part of a cluster, and for membership changes in that cluster to be relatively rare, and those changes require coordination. Furthermore, it expects that the cluster is formed by bringing up all of the nodes, creating a schema with the set of nodes participating in the Mnesia cluster, then starting Mnesia on all of those nodes (including the one where the schema is first created). Once the schema is replicated, then all of the nodes can be terminated safely and brought back up, but only if the schema data is persistent, otherwise you either have to start from scratch again, or bring up new nodes, use add_table_copy
to introduce the new nodes to the cluster, then start Mnesia on those nodes. You may also see issues if you create a schema on more than one node, and then connect those nodes, since naturally the schemas will conflict. You cannot create a schema until the cluster is formed, unless you do so on only one node, then use add_table_copy
like I mentioned above.
I don't believe this is a libcluster
issue per se, but we'll need more information for sure in order to diagnose the root cause. The initial :nxdomain
error is pretty clear as I explained above, but even if you address that specific error, the bigger issue here is that new pods are not able to connect to the old cluster to sync the Mnesia schema before the old pods are shutdown. This presumably causes the new pods to try and create the schema themselves, which will occur independently on all of them since the new pods won't see each other until they have SRV records, which doesn't happen until k8s marks them as live. Ultimately I think you will need to use something like StatefulSets to ensure that you can bring a cluster up with a fixed set of nodes properly; and manually introduce new nodes to that cluster as needed (as well as remove them).
Thanks for the detailed description @bitwalker!
@gabrielrinaldi has mentioned to me in the elixir slack channel that https://github.com/Shopify/kubernetes-deploy is used for deploys .
The Mnesia cache in Pow is a GenServer that automatically handles initialization and replication in Mnesia as long as a list of existing cluster nodes are set for the init callback. This is usually done by providing the value of Node.list()
when starting the Mnesia cache. If any of the provided nodes has Mnesia running then it'll start replicating from the cluster, otherwise it'll load from what's persisted to disk (or create a new empty schema).
I've been helping with a guide on how to integrate libcluster with Pow/Mnesia.
The following setup with Cluster.Strategy.Kubernetes
is what has worked for me, to ensure that a list of cluster nodes is passed on to the Mnesia cache:
def start(_type, _args) do
topologies = [
example: [
strategy: Cluster.Strategy.Kubernetes,
config: [
# ...
]
]
]
# List all child processes to be supervised
children = [
{Cluster.Supervisor, [topologies, [name: MyApp.ClusterSupervisor]]},
MyApp.MnesiaClusterSupervisor,
# Start the Ecto repository
MyApp.Repo,
# Start the endpoint when the application starts
MyAppWeb.Endpoint
# Starts a worker by calling: MyApp.Worker.start_link(arg)
# {MyApp.Worker, arg},
]
# See https://hexdocs.pm/elixir/Supervisor.html
# for other strategies and supported options
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
end
defmodule MyApp.MnesiaClusterSupervisor do
use Supervisor
def start_link(init_arg) do
Supervisor.start_link(__MODULE__, init_arg, name: __MODULE__)
end
@impl true
def init(_init_arg) do
children = [
{Pow.Store.Backend.MnesiaCache, extra_db_nodes: Node.list()},
Pow.Store.Backend.MnesiaCache.Unsplit
]
Supervisor.init(children, strategy: :one_for_one)
end
end
But I don't know if this is robust? Would it be better to restart the Mnesia cache in the :connect
opt callback with the DNS SRV setup?
There is support for netsplit recovery btw. The thing I want to make sure of is that the Mnesia cache GenServer starts with at least one active cluster node (if a cluster exists) to init replication.
@danschultzer Yeah that's more or less what I was expecting. The problem(s) I described in my last comment are definitely possible with that setup.
There are few factors at play:
NOTE: It is not clear from the kubernetes-deploy
docs what all it does, at least at a glance, so what would be particularly helpful is a dump of the configuration for the Deployment/ReplicaSets involved.
The best setup here in my opinion is to be able to boot a node, have it fully start without starting Mnesia, but only be marked ready, not live; and once started, the node tries to join up with a pre-existing cluster for some period of time before deciding to form a new cluster. Once Mnesia is started, the components which depend on Mnesia are then able to be started as well (and depending on how you approach it, may be able to automatically start on their own once they see that Mnesia is ready). This technique is known to me as "stacking theory", and you can read more about the idea here. The key is that you don't necessarily try to accomplish absolutely everything in the initialization of the supervisor tree - some things should necessarily be deferred and lazily instantiated once they are ready, and as that happens, the system approaches operational readiness. Your liveness checks then just evaluate whether the system has fully started or not (or has at least obtained enough operational capability to be allowed to start servicing requests).
The bottom line though, is that the current setup you've outlined is racy, and I believe that is the main culprit here. I could be more precise given more information about the exact sequence in which events occur (i.e. how deploys are rolled out, step by step, pod by pod); but hopefully what I've outlined here clarifies the issue well enough to allow you to troubleshoot. I really do want to stress that I don't think Mnesia is compatible with cattle-style nodes, it is designed for pet-style/persistent nodes. This also simplifies a lot of things (such as which node is responsible for initializing the schema/cluster), whereas trying to make Mnesia work with a shifting set of anonymous nodes is very fragile.
Thanks for the excellent response @bitwalker! That makes a lot of sense.
I found this blog post that goes into StatefulSet with Cassandra cluster which I think is what is needed here: https://medium.com/velotio-perspectives/exploring-upgrade-strategies-for-stateful-sets-in-kubernetes-c02b8286f251
@gabrielrinaldi what is your deploy strategy? Can you share the YAML?
Also any other info would help.
@bitwalker I am almost convinced that redis
would be a better option on a Kubernetes environment. Imagine that you have multiple deploys a day, the chances of something breaking is just too high.
@danschultzer here is our yml
file:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: backend-elixir
namespace: backend-elixir-staging
annotations:
fluentbit.io/parser: docker
spec:
serviceName: backend-elixir
replicas: 2
revisionHistoryLimit: 5
selector:
matchLabels:
app: backend-elixir
template:
metadata:
labels:
app: backend-elixir
annotations:
fluentbit.io/parser: docker
spec:
containers:
- name: backend-elixir
image: us.gcr.io/aero-falcon/backend-elixir:<%= current_sha %>
resources:
requests:
memory: "1000Mi"
cpu: "500m"
limits:
memory: "1500Mi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /healthz
port: 4001
httpHeaders:
- name: X-Custom-Header
value: Awesome
timeoutSeconds: 5
periodSeconds: 10
initialDelaySeconds: 15
readinessProbe:
httpGet:
path: /healthz
port: 4001
httpHeaders:
- name: X-Custom-Header
value: Awesome
timeoutSeconds: 10
periodSeconds: 10
initialDelaySeconds: 20
imagePullPolicy: Always
ports:
- containerPort: 4000
name: agent
- containerPort: 4001
name: api
- containerPort: 4369
name: remote
- containerPort: 9001
name: remote1
envFrom:
- secretRef:
name: backend-elixir
- configMapRef:
name: backend-elixir
securityContext:
privileged: true
allowPrivilegeEscalation: true
volumeMounts:
- name: cache
mountPath: /mnesia_cache
volumeClaimTemplates:
- metadata:
name: cache
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: blazing
resources:
requests:
storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: db-proxy
namespace: backend-elixir-staging
spec:
replicas: 1
revisionHistoryLimit: 5
selector:
matchLabels:
app: db-proxy
template:
metadata:
labels:
app: db-proxy
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- db-proxy
topologyKey: kubernetes.io/hostname
volumes:
- name: cloudsql-instance-credentials
secret:
secretName: cloudsql-instance-credentials
containers:
- name: cloudsql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.11
command: ["/cloud_sql_proxy",
"-instances=aero-falcon:us-west1:aero-staging=tcp:0.0.0.0:5432",
"-credential_file=/secrets/cloudsql/cloud-proxy-staging.json"]
securityContext:
runAsUser: 2 # non-root user
allowPrivilegeEscalation: false
volumeMounts:
- name: cloudsql-instance-credentials
mountPath: /secrets/cloudsql
readOnly: true
@danschultzer our update strategy is to pull one node at a time, put the new one back in and wait for it to pass the health check. Then repeat with the rest of the pods.
@gabrielrinaldi In a cluster with dynamic membership, I would agree with you that Redis is a better choice, though they aren't really 1:1, as Mnesia is a distributed key/value store while Redis is not. If you are able to use Redis to solve the same problem, then Mnesia is definitely not worth the operational overhead in my opinion.
That said, in a cluster with static membership, Mnesia is able to provide some nice benefits if your use case is a good fit, notably for read-heavy workloads.
Since you are using StatefulSet here, you have the foundation needed for the static cluster. I think aside from the issues I've already mentioned, part of the problem here may be that libcluster is resolving an IP and using that to connect nodes, and what happens is that at some point the IP for a given node (say backend-elixir-0
) changes, and this throws things out of wack. This is fixable in libcluster; and in your case, you can test whether it addresses the issue for you:
DNSSRV
libcluster strategy to the ErlangHosts
strategy. You should probably be using that anyway right now until you get some code in place to handle dealing with adding new replicas to the cluster properly.hosts.erlang
file to the $HOME
directory of the container, with the following entries:'app@backend-elixir-0.backend-elixir.backend-elixir-staging.svc.cluster.local'
'app@backend-elixir-1.backend-elixir.backend-elixir-staging.svc.cluster.local'
The above assumes a cluster of two pods, and since you haven't shared the Service definition, I'm guessing at the name (which is the second component of each DNS name). Likewise, the Erlang node name I've just filled in as app
, but should be changed to whatever your node names are. Add more entries for the number of replicas you have.
Restart your cluster from scratch (i.e. delete the Mnesia disk copies - back them up if you need to save the data though), and then try and replicate the issue by performing a few rolling updates. If you are able to reproduce the problem again, then we know it is due to one of the other issues and not how libcluster is connecting the nodes in the DNSSRV strategy.
NOTE: If you are new to Elixir, trying to dive into using Mnesia right out of the gate may not be worth the effort. If it is the right solution to your problem, then that is one thing, just be aware that like many distributed systems, it takes additional preparation and maintenance to keep things running smoothly, it is not a "set it and forget it" component of OTP. That said, this is a great opportunity to learn more about how it works, so if you have some time to apply to figuring it out now, it is worth trying to at least get to the bottom of this particular problem.
Turns out I was mistaken about how libcluster resolves the IPs when using the DNSSRV strategy - it actually does use the DNS name, not the IP. Nevertheless, may still be worth switching to ErlangHosts for simplicity. However, it may also be that the nodes themselves are using their IP as their node hostname, and not the DNS, i.e. backend-elixir@x.x.x.x
, rather than backend-elixir@backend-elixir-0.backend-elixir.backend-elixir-staging.cluster.svc.local
. I'm not 100% on whether that matters to Mnesia, but keeping those linked up would eliminate one more possible issue from the list.
I'm still not 100% sure whether the fact that a node starts up without a DNS record is part of the issue here or not. If it is, then that complicates things one way or another, unless k8s has an easy way to assign static IPs to pods that I don't know about.
I am using
Cluster.Strategy.Kubernetes.DNSSRV
for the strategy. Everything works fine on the initial deploy. But after a few deploys I get the following error:The only way to recover from it is delete the mnesia cache disk for each of the pods and restart everything.
The only thing using mnesia so far is pow.
I am new to Elixir, but I am happy to gather more information if someone points me in the right direction.