Closed jhutchins closed 1 year ago
(copying over discussions from the cockroa.ch/slack)
Someone else can chime in here (and do wait for a few other opinions), but all the building blocks for this kind of setup already exist. I’m not aware of existing users doing this, but theoretically you could mount an in-memory filesystem and point the CRDB node to blindly use that as the backing store.
On restart, that node would behave as if it’s entire directory was wiped out so would simply start over (fresh node essentially). You’d have to up your replication factor accordingly like you mentioned, though I’m not familiar with how one would go setting this up through k8s, or if anything is fundamentally broken with this setup. My understanding is that any failure of this purely in-memory node would/should be treated/modeled/understood as a permanent failure, and restart of the node would be treated/modeled/understood as the addition of a fresh node.
I've experimented with doing this in Kubernetes with EmptyDir volumes with a StatefulSet, but nodes coming back with the same host address, but no data seems to greatly confused the cluster.
This should already work. What are the errors you're seeing?
The issue I see is that when the node restarts it joins the cluster as a new node and receives a new node ID, however since it has the same hostname as before its restart it gets messages for old and new node IDs. This seems to cause some level of confusion in the cluster and as a result I see some chunks not getting fully replicated and lots of error messages on the restarted node with regards to its old ID.
You can setup a minimal example to show this issue in kubernetes with the following yaml and commands.
apiVersion: v1
items:
- apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app: cockroachdb
name: cockroachdb
spec:
podManagementPolicy: Parallel
replicas: 3
selector:
matchLabels:
app: cockroachdb
serviceName: cockroachdb
template:
metadata:
labels:
app: cockroachdb
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: component
operator: In
values:
- roach-test-cockroachdb
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- command:
- /bin/bash
- -ecx
- exec /cockroach/cockroach start --logtostderr --insecure --advertise-host
$(hostname).${STATEFULSET_FQDN} --http-host 0.0.0.0 --http-port 8080 --port
26257 --cache 25% --max-sql-memory 25% --join ${STATEFULSET_NAME}-0.${STATEFULSET_FQDN}:26257,${STATEFULSET_NAME}-1.${STATEFULSET_FQDN}:26257,${STATEFULSET_NAME}-2.${STATEFULSET_FQDN}:26257
env:
- name: STATEFULSET_NAME
value: cockroachdb
- name: STATEFULSET_FQDN
value: cockroachdb
image: cockroachdb/cockroach:v19.1.5
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: cockroachdb
ports:
- containerPort: 26257
name: grpc
protocol: TCP
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 2
httpGet:
path: /health?ready=1
port: http
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
volumeMounts:
- mountPath: /cockroach/cockroach-data
name: datadir
volumes:
- name: datadir
emptyDir: {}
updateStrategy:
type: RollingUpdate
- apiVersion: v1
kind: Service
metadata:
labels:
app: cockroachdb
name: cockroachdb
spec:
clusterIP: None
ports:
- name: grpc
port: 26257
protocol: TCP
targetPort: 26257
- name: http
port: 8080
protocol: TCP
targetPort: 8080
publishNotReadyAddresses: true
selector:
app: cockroachdb
type: ClusterIP
- apiVersion: v1
kind: Service
metadata:
labels:
app: cockroachdb
name: cockroachdb-public
spec:
ports:
- name: grpc
port: 26257
protocol: TCP
targetPort: 26257
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
app: cockroachdb
type: ClusterIP
kind: List
metadata: {}
Create the kubernetes resources.
kubectl create -f template.yaml
Wait for all the pods to be running then init the cluster.
kubectl exec -ti cockroachdb-0 -- /cockroach/cockroach init --insecure
Delete the first nodes.
kubectl delete po cockroachdb-2
Wait for the cluster to become fully replicated again then delete another node.
kubectl delete po cockroachdb-1
The cluster will never finish replicating.
On cockroachdb-0
you see
W191128 05:24:41.429983 81610 storage/store.go:3587 [n1,s1] raft error: node 2 claims to not contain store 2 for replica (n2,s2):?: store 2 was not found
W191128 05:24:41.430009 81608 storage/raft_transport.go:583 [n1] while processing outgoing Raft queue to node 2: store 2 was not found:
W191128 05:24:41.630093 81566 storage/store.go:3587 [n1,s1] raft error: node 2 claims to not contain store 2 for replica (n2,s2):?: store 2 was not found
W191128 05:24:41.630120 81439 storage/raft_transport.go:583 [n1] while processing outgoing Raft queue to node 2: store 2 was not found:
W191128 05:24:41.830098 81597 storage/store.go:3587 [n1,s1] raft error: node 2 claims to not contain store 2 for replica (n2,s2):?: store 2 was not found
W191128 05:24:41.830124 81595 storage/raft_transport.go:583 [n1] while processing outgoing Raft queue to node 2: store 2 was not found:
On cockroachdb-1
you see
W191128 05:27:01.949250 53171 storage/raft_transport.go:281 unable to accept Raft message from (n1,s1):?: no handler registered for (n2,s2):?
W191128 05:27:02.106902 53161 storage/raft_transport.go:281 unable to accept Raft message from (n4,s4):?: no handler registered for (n2,s2):?
W191128 05:27:02.549532 53173 storage/raft_transport.go:281 unable to accept Raft message from (n1,s1):?: no handler registered for (n2,s2):?
W191128 05:27:02.707177 53163 storage/raft_transport.go:281 unable to accept Raft message from (n4,s4):?: no handler registered for (n2,s2):?
W191128 05:27:02.749343 53175 storage/raft_transport.go:281 unable to accept Raft message from (n1,s1):?: no handler registered for (n2,s2):?
On cockroachdb-2
you see
W191128 05:28:20.053595 43358 storage/store.go:3587 [n4,s4] raft error: node 2 claims to not contain store 2 for replica (n2,s2):?: store 2 was not found
W191128 05:28:20.053640 43356 storage/raft_transport.go:583 [n4] while processing outgoing Raft queue to node 2: store 2 was not found:
W191128 05:28:20.653540 43412 storage/store.go:3587 [n4,s4] raft error: node 2 claims to not contain store 2 for replica (n2,s2):?: store 2 was not found
W191128 05:28:20.653576 43346 storage/raft_transport.go:583 [n4] while processing outgoing Raft queue to node 2: store 2 was not found:
W191128 05:28:21.053452 43310 storage/store.go:3587 [n4,s4] raft error: node 2 claims to not contain store 2 for replica (n2,s2):?: store 2 was not found
W191128 05:28:21.053482 43416 storage/raft_transport.go:583 [n4] while processing outgoing Raft queue to node 2: store 2 was not found:
I've also tried the exact same experiment with version 19.2.1 with similar results.
Beyond the issues with my ephemeral storage use case I think this might have implications with how the cluster handles nodes with disk failures in a more traditional setup.
For me ideally, when the node restarted instead of getting a new node ID it would be informed somehow of its old node ID and would continue to function using its previous identity.
Most of the team here is off due to holidays. I'm not too familiar with node ID re-use issues, or k8s more broadly (+cc @jseldess perhaps?), so I've handed this off elsewhere.
when the node restarted instead of getting a new node ID it would be informed somehow of its old node ID and would continue to function using its previous identity
I assume you're asking this assuming full disk failure? I don't believe we tie node IDs with hostnames durably, and with full disk failure the store is completely lost. The CRDB "node" abstraction is tightly coupled the the "store" (read: disk) abstraction. Restarting with a fresh disk effectively is a new node. One reason for this is Raft, it requires participants to durably persist records. With temporary failure, it can catch the node up from where it left off. If the durability guarantee is no longer provided (with ephemeral storage for e.g.), each node failure has to be treated a permanent one. Restarting the new node with the "same identity" would not work, as it no longer has access to records it was required to have (this would not be a problem if the node process simply crashed but the persisted data was still present).
To help me understand the problem you're having better: Why is it ideal to reuse node IDs?
@ajwerner: We should be robust to the new node at the same IP as of 19.2 (maybe even 19.1 I don’t remember) at the rpc context level but I don’t think we tested it all that well. I wouldn’t be stunned to find out we’re caching an address or some other already dialed connection and don’t close it. We should write a roachtest or at least a test for this scenario.
Why is it ideal to reuse node IDs?
Two reasons: 1) The source of the issue appears to be multiple identities for the same hostname, so maintaining identity would fix that issue. 2) Obtaining a new identity means that node restarts create failed node records that need to be cleaned up.
Zendesk ticket #4179 has been linked to this issue.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
Non-persistent storage doesn't seem like a thing that CockroachDB supports at the moment, but seems like a thing that it should be able to. With a sufficiently high replication a single node should be able go down, come back up with an entirely empty data directory and replicate back the data that it should have.
I've experimented with doing this in Kubernetes with EmptyDir volumes with a StatefulSet, but nodes coming back with the same host address, but no data seems to greatly confused the cluster. Is there any interest in this kind feature? Or am I the only person that wants this.
Jira issue: CRDB-5325