atomix / consensus-storage

Atomix operator for multi-raft storage
2 stars 3 forks source link

ConcensusStore does not maintain status in relation to managed StatefulSet #1

Open ghost opened 1 year ago

ghost commented 1 year ago

My initial comment needs improvement, I'll post an update with better details shortly.

ghost commented 1 year ago

Overview

Upon creating a new ConsensusStore resource, it appears that no information about the StatefulSet is used to inform on the ConsensusStore's status.

If we create this ConsensusStore as an example:

cat <<EOF | kubectl apply -f -
apiVersion: consensus.atomix.io/v1beta1
kind: ConsensusStore
metadata:
  name: my-consensus-store
spec:
  replicas: 3
  groups: 30
  volumeClaimTemplate:
    spec:
      accessModes:
      - ReadWriteOnce
      storageClass: "standard"
      resources:
        requests:
          storage: 2Gi
EOF

The beginning of its life looks something like this.

% kubectl get ConsensusStores,StatefulSets
NAME                                                    STATUS
consensusstore.consensus.atomix.io/my-consensus-store   

NAME                                  READY   AGE
statefulset.apps/my-consensus-store   0/3     1s

% kubectl get ConsensusStores,StatefulSets        
NAME                                                    STATUS
consensusstore.consensus.atomix.io/my-consensus-store   

NAME                                  READY   AGE
statefulset.apps/my-consensus-store   2/3     15s

% kubectl get ConsensusStores,StatefulSets
NAME                                                    STATUS
consensusstore.consensus.atomix.io/my-consensus-store   NotReady

NAME                                  READY   AGE
statefulset.apps/my-consensus-store   3/3     21s

% kubectl get ConsensusStores,StatefulSets
NAME                                                    STATUS
consensusstore.consensus.atomix.io/my-consensus-store   Ready

NAME                                  READY   AGE
statefulset.apps/my-consensus-store   3/3     2m58s

When I submitted https://github.com/atomix/atomix.github.io/pull/26 I saw what appears to be unhandled conditions during startup as well:

2023-01-24T14:47:17.815Z        INFO    github.com/atomix/consensus-storage/controller/pkg/controller/consensus/v1beta1 v1beta1/cluster.go:440  Reconcile raft protocol service
2023-01-24T14:47:17.815Z        INFO    github.com/atomix/consensus-storage/controller/pkg/controller/consensus/v1beta1 v1beta1/cluster.go:485  Reconcile raft protocol headless
 service
2023-01-24T14:47:17.815Z        ERROR   github.com/atomix/consensus-storage/controller/pkg/controller/consensus/v1beta1 v1beta1/cluster.go:184  Pod "my-consensus-store-0" not f
oundReconcile MultiRaftCluster
github.com/atomix/consensus-storage/controller/pkg/controller/consensus/v1beta1.(*MultiRaftClusterReconciler).Reconcile
        github.com/atomix/consensus-storage/controller/pkg/controller/consensus/v1beta1/cluster.go:184
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:121
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:320
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:234

Also the controller does not seem to update it's status when it can't apply changes to the StatefulSet like this:

pilot@cove reflow % kubectl get ConsensusStores,StatefulSets
NAME                                                    STATUS
consensusstore.consensus.atomix.io/my-consensus-store   Ready

NAME                                  READY   AGE
statefulset.apps/my-consensus-store   3/3     13m

% cat <<EOF | kubectl apply -f -
apiVersion: consensus.atomix.io/v1beta1
kind: ConsensusStore
metadata:
  name: my-consensus-store
spec:
  replicas: 3
  groups: 30
  volumeClaimTemplate:
    spec:
      accessModes:
      - ReadWriteOnce
      storageClass: "standard"
      resources:
        requests:
          storage: 2Gi
EOF
consensusstore.consensus.atomix.io/my-consensus-store configured

% kubectl get ConsensusStores
NAME                 STATUS
my-consensus-store   Ready

The controller actually can't update the field on the StatefulSet, but I don't see any errors in the controller logs from it attempting to apply an update. Only this is produced:

2023-01-24T17:45:50.169Z    INFO    github.com/atomix/consensus-storage/controller/pkg/controller/consensus/v1beta1 v1beta1/store.go:88 Reconcile ConsensusStore
2023-01-24T17:45:50.169Z    INFO    github.com/atomix/consensus-storage/controller/pkg/controller/consensus/v1beta1 v1beta1/store.go:99 Reconcile raft protocol stateful set

What happens when you try to make changes to the volumeClaimTemplate with kubectl manually would be this error:

The StatefulSet "web" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden

Possible Improvements

  1. From the following output, the StatefulSet is not ready because desired_replicas != available_replicas. The ConsensusStore could possibly be defined as at least Pending.
    
    % kubectl get ConsensusStore,StatefulSet
    NAME                                                    STATUS
    consensusstore.consensus.atomix.io/my-consensus-store   

NAME READY AGE statefulset.apps/my-consensus-store 0/3 83m



2. Produce information somewhere when failing to update the StatefulSet.
Updating the status, creating an event, and having an error log entry for this is probably enough to help users out.