cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

Non-persistent Storage #42769

Closed jhutchins closed 1 year ago

jhutchins commented 4 years ago

Non-persistent storage doesn't seem like a thing that CockroachDB supports at the moment, but seems like a thing that it should be able to. With a sufficiently high replication a single node should be able go down, come back up with an entirely empty data directory and replicate back the data that it should have.

I've experimented with doing this in Kubernetes with EmptyDir volumes with a StatefulSet, but nodes coming back with the same host address, but no data seems to greatly confused the cluster. Is there any interest in this kind feature? Or am I the only person that wants this.

Jira issue: CRDB-5325

irfansharif commented 4 years ago

(copying over discussions from the cockroa.ch/slack)

Someone else can chime in here (and do wait for a few other opinions), but all the building blocks for this kind of setup already exist. I’m not aware of existing users doing this, but theoretically you could mount an in-memory filesystem and point the CRDB node to blindly use that as the backing store.

On restart, that node would behave as if it’s entire directory was wiped out so would simply start over (fresh node essentially). You’d have to up your replication factor accordingly like you mentioned, though I’m not familiar with how one would go setting this up through k8s, or if anything is fundamentally broken with this setup. My understanding is that any failure of this purely in-memory node would/should be treated/modeled/understood as a permanent failure, and restart of the node would be treated/modeled/understood as the addition of a fresh node.

irfansharif commented 4 years ago

I've experimented with doing this in Kubernetes with EmptyDir volumes with a StatefulSet, but nodes coming back with the same host address, but no data seems to greatly confused the cluster.

This should already work. What are the errors you're seeing?

jhutchins commented 4 years ago

The issue I see is that when the node restarts it joins the cluster as a new node and receives a new node ID, however since it has the same hostname as before its restart it gets messages for old and new node IDs. This seems to cause some level of confusion in the cluster and as a result I see some chunks not getting fully replicated and lots of error messages on the restarted node with regards to its old ID.

jhutchins commented 4 years ago

You can setup a minimal example to show this issue in kubernetes with the following yaml and commands.

template.yaml

apiVersion: v1
items:
- apiVersion: apps/v1
  kind: StatefulSet
  metadata:
    labels:
      app: cockroachdb
    name: cockroachdb
  spec:
    podManagementPolicy: Parallel
    replicas: 3
    selector:
      matchLabels:
        app: cockroachdb
    serviceName: cockroachdb
    template:
      metadata:
        labels:
          app: cockroachdb
      spec:
        affinity:
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm:
                labelSelector:
                  matchExpressions:
                  - key: component
                    operator: In
                    values:
                    - roach-test-cockroachdb
                topologyKey: kubernetes.io/hostname
              weight: 100
        containers:
        - command:
          - /bin/bash
          - -ecx
          - exec /cockroach/cockroach start --logtostderr --insecure --advertise-host
            $(hostname).${STATEFULSET_FQDN} --http-host 0.0.0.0 --http-port 8080 --port
            26257 --cache 25% --max-sql-memory 25%  --join ${STATEFULSET_NAME}-0.${STATEFULSET_FQDN}:26257,${STATEFULSET_NAME}-1.${STATEFULSET_FQDN}:26257,${STATEFULSET_NAME}-2.${STATEFULSET_FQDN}:26257
          env:
          - name: STATEFULSET_NAME
            value: cockroachdb
          - name: STATEFULSET_FQDN
            value: cockroachdb
          image: cockroachdb/cockroach:v19.1.5
          imagePullPolicy: Always
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 30
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          name: cockroachdb
          ports:
          - containerPort: 26257
            name: grpc
            protocol: TCP
          - containerPort: 8080
            name: http
            protocol: TCP
          readinessProbe:
            failureThreshold: 2
            httpGet:
              path: /health?ready=1
              port: http
              scheme: HTTP
            initialDelaySeconds: 10
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          volumeMounts:
          - mountPath: /cockroach/cockroach-data
            name: datadir
        volumes:
        - name: datadir
          emptyDir: {}
    updateStrategy:
      type: RollingUpdate
- apiVersion: v1
  kind: Service
  metadata:
    labels:
      app: cockroachdb
    name: cockroachdb
  spec:
    clusterIP: None
    ports:
    - name: grpc
      port: 26257
      protocol: TCP
      targetPort: 26257
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
    publishNotReadyAddresses: true
    selector:
      app: cockroachdb
    type: ClusterIP
- apiVersion: v1
  kind: Service
  metadata:
    labels:
      app: cockroachdb
    name: cockroachdb-public
  spec:
    ports:
    - name: grpc
      port: 26257
      protocol: TCP
      targetPort: 26257
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
    selector:
      app: cockroachdb
    type: ClusterIP
kind: List
metadata: {}

Instructions

Create the kubernetes resources.

kubectl create -f template.yaml

Wait for all the pods to be running then init the cluster.

kubectl exec -ti cockroachdb-0 -- /cockroach/cockroach init --insecure

Delete the first nodes.

kubectl delete po cockroachdb-2

Wait for the cluster to become fully replicated again then delete another node.

kubectl delete po cockroachdb-1

The cluster will never finish replicating.

On cockroachdb-0 you see

W191128 05:24:41.429983 81610 storage/store.go:3587  [n1,s1] raft error: node 2 claims to not contain store 2 for replica (n2,s2):?: store 2 was not found
W191128 05:24:41.430009 81608 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 2: store 2 was not found:
W191128 05:24:41.630093 81566 storage/store.go:3587  [n1,s1] raft error: node 2 claims to not contain store 2 for replica (n2,s2):?: store 2 was not found
W191128 05:24:41.630120 81439 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 2: store 2 was not found:
W191128 05:24:41.830098 81597 storage/store.go:3587  [n1,s1] raft error: node 2 claims to not contain store 2 for replica (n2,s2):?: store 2 was not found
W191128 05:24:41.830124 81595 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 2: store 2 was not found:

On cockroachdb-1 you see

W191128 05:27:01.949250 53171 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):?: no handler registered for (n2,s2):?
W191128 05:27:02.106902 53161 storage/raft_transport.go:281  unable to accept Raft message from (n4,s4):?: no handler registered for (n2,s2):?
W191128 05:27:02.549532 53173 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):?: no handler registered for (n2,s2):?
W191128 05:27:02.707177 53163 storage/raft_transport.go:281  unable to accept Raft message from (n4,s4):?: no handler registered for (n2,s2):?
W191128 05:27:02.749343 53175 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):?: no handler registered for (n2,s2):?

On cockroachdb-2 you see

W191128 05:28:20.053595 43358 storage/store.go:3587  [n4,s4] raft error: node 2 claims to not contain store 2 for replica (n2,s2):?: store 2 was not found
W191128 05:28:20.053640 43356 storage/raft_transport.go:583  [n4] while processing outgoing Raft queue to node 2: store 2 was not found:
W191128 05:28:20.653540 43412 storage/store.go:3587  [n4,s4] raft error: node 2 claims to not contain store 2 for replica (n2,s2):?: store 2 was not found
W191128 05:28:20.653576 43346 storage/raft_transport.go:583  [n4] while processing outgoing Raft queue to node 2: store 2 was not found:
W191128 05:28:21.053452 43310 storage/store.go:3587  [n4,s4] raft error: node 2 claims to not contain store 2 for replica (n2,s2):?: store 2 was not found
W191128 05:28:21.053482 43416 storage/raft_transport.go:583  [n4] while processing outgoing Raft queue to node 2: store 2 was not found:
jhutchins commented 4 years ago

I've also tried the exact same experiment with version 19.2.1 with similar results.

Beyond the issues with my ephemeral storage use case I think this might have implications with how the cluster handles nodes with disk failures in a more traditional setup.

jhutchins commented 4 years ago

For me ideally, when the node restarted instead of getting a new node ID it would be informed somehow of its old node ID and would continue to function using its previous identity.

irfansharif commented 4 years ago

Most of the team here is off due to holidays. I'm not too familiar with node ID re-use issues, or k8s more broadly (+cc @jseldess perhaps?), so I've handed this off elsewhere.

when the node restarted instead of getting a new node ID it would be informed somehow of its old node ID and would continue to function using its previous identity

I assume you're asking this assuming full disk failure? I don't believe we tie node IDs with hostnames durably, and with full disk failure the store is completely lost. The CRDB "node" abstraction is tightly coupled the the "store" (read: disk) abstraction. Restarting with a fresh disk effectively is a new node. One reason for this is Raft, it requires participants to durably persist records. With temporary failure, it can catch the node up from where it left off. If the durability guarantee is no longer provided (with ephemeral storage for e.g.), each node failure has to be treated a permanent one. Restarting the new node with the "same identity" would not work, as it no longer has access to records it was required to have (this would not be a problem if the node process simply crashed but the persisted data was still present).

irfansharif commented 4 years ago

To help me understand the problem you're having better: Why is it ideal to reuse node IDs?

irfansharif commented 4 years ago

@ajwerner: We should be robust to the new node at the same IP as of 19.2 (maybe even 19.1 I don’t remember) at the rpc context level but I don’t think we tested it all that well. I wouldn’t be stunned to find out we’re caching an address or some other already dialed connection and don’t close it. We should write a roachtest or at least a test for this scenario.

jhutchins commented 4 years ago

Why is it ideal to reuse node IDs?

Two reasons: 1) The source of the issue appears to be multiple identities for the same hostname, so maintaining identity would fix that issue. 2) Obtaining a new identity means that node restarts create failed node records that need to be cleaned up.

RoachietheSupportRoach commented 4 years ago

Zendesk ticket #4179 has been linked to this issue.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!