Orange-OpenSource / casskop

This Kubernetes operator automates the Cassandra operations such as deploying a new rack aware cluster, adding/removing nodes, configuring the C* and JVM parameters, upgrading JVM and C* versions, and many more...
https://orange-opensource.github.io/casskop/
Apache License 2.0
183 stars 54 forks source link

Unattached block storages created #209

Closed sheerun closed 4 years ago

sheerun commented 4 years ago

I've created cassandra cluster with configuration below, but operator managed to provision few extra unattached volumes. I think it might be because there is some kind of delay between volume creation and operator requesting volume to be created. Or maybe because on Scaleway requesting pvc with given name is not idempotent and names are allowed to be duplicated (they generate their own unique ids for volumes).

Hosting provider: Scaleway kapsule

Cluster configuration:

apiVersion: "db.orange.com/v1alpha1"
kind: "CassandraCluster"
metadata:
  name: cassandra-cluster
  labels:
    cluster: k8s.kaas
spec:
  cassandraImage: cassandra:3.11.6
  bootstrapImage: orangeopensource/cassandra-bootstrap:0.1.4
  configMapName: cassandra-configmap-v1
  dataCapacity: "10Gi"
  dataStorageClass: ""
  imagepullpolicy: IfNotPresent  
  hardAntiAffinity: false
  deletePVC: true
  autoPilot: true
  gcStdout: true
  autoUpdateSeedList: true
  maxPodUnavailable: 1
  resources:         
    requests:
      cpu: '500m'
      memory: '512Mi'
    limits:
      cpu: '1000m'
      memory: '1024Mi'
  topology:
    dc:
      - name: dc1
        nodesPerRacks: 1
        rack:
          - name: rack1
          - name: rack2
          - name: rack3

Screenshot of extra volumes: https://imgur.com/a/CvA3Hqz

At kubernetes level it seems there are only 3 PVC:

kubectl get pvc
NAME                                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-cassandra-cluster-dc1-rack1-0   Bound    pvc-983c3a5f-3ff8-458f-9912-1d23e0c5acf7   10Gi       RWO            scw-bssd       15m
data-cassandra-cluster-dc1-rack2-0   Bound    pvc-9e4a6819-0b65-4d3b-9f89-12494bda12e3   10Gi       RWO            scw-bssd       12m
data-cassandra-cluster-dc1-rack3-0   Bound    pvc-c2752aa0-8e8f-4cdc-b334-fcae5bfb7c07   10Gi       RWO            scw-bssd       10m

All pods are in ready state:

kubectl get pod
NAME                                          READY   STATUS    RESTARTS   AGE
cassandra-cluster-dc1-rack1-0                 1/1     Running   0          28m
cassandra-cluster-dc1-rack2-0                 1/1     Running   0          25m
cassandra-cluster-dc1-rack3-0                 1/1     Running   0          24m
casskop-cassandra-operator-5856b56ccd-bjb85   1/1     Running   0          35h

What is interesting that all extra unattached volumes have the same name as first attached volume. 1f7c1873-9f5b-46e4-934d-1ff3a6b41209_pvc-983c3a5f-3ff8-458f-9912-1d23e0c5acf7

Logs: https://pastebin.com/0zgSqP81

ping @scaleway

sheerun commented 4 years ago

Also, here are events from first PVC:

Events:
  Type     Reason                 Age                From                                                                                  Message
  ----     ------                 ----               ----                                                                                  -------
  Warning  ProvisioningFailed     31m (x5 over 32m)  csi.scaleway.com_csi-controller-df8ffbf57-5rckt_29f03a79-7ca1-47d3-af60-eb51413ac1ac  failed to provision volume with StorageClass "scw-bssd": rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Normal   ExternalProvisioning   31m (x7 over 32m)  persistentvolume-controller                                                           waiting for a volume to be created, either by external provisioner "csi.scaleway.com" or manually created by system administrator
  Normal   Provisioning           31m (x6 over 32m)  csi.scaleway.com_csi-controller-df8ffbf57-5rckt_29f03a79-7ca1-47d3-af60-eb51413ac1ac  External provisioner is provisioning volume for claim "default/data-cassandra-cluster-dc1-rack1-0"
  Normal   ProvisioningSucceeded  31m                csi.scaleway.com_csi-controller-df8ffbf57-5rckt_29f03a79-7ca1-47d3-af60-eb51413ac1ac  Successfully provisioned volume pvc-983c3a5f-3ff8-458f-9912-1d23e0c5acf7

There seems to be the same number of timeouts as number of extra volumes

sheerun commented 4 years ago

I've notified Scaleway support as it seems this can be Scaleway's issue, not Casskop's.

Sh4d1 commented 4 years ago

@sheerun saw the ticket. It was a problem of an old version of our csi, should be fixed now. Please open an issue in https://github.com/scaleway/scaleway-csi/ if it happens again!

fdehay commented 4 years ago

Thanks @Sh4d1 for commenting and welcome to Casskop :) Please @sheerun could you close the issue if you confirm it works now.

sheerun commented 4 years ago

Indeed it seems to work now :) Thank you!