IBM / ibm-storage-odf-console

ibm-storage-odf-console provides IBM storage specific console page, which will be loaded by ODF console when end users access IBM storage. It's specially designed for displaying IBM specific storage attributes to customer. Current scope includes IBM flashsystem only.
Apache License 2.0
3 stars 7 forks source link

Deleting StorageSystem resource hangs indefinitely #34

Closed stewartad closed 2 years ago

stewartad commented 3 years ago
Screen Shot 2021-09-21 at 11 57 53 AM

I came across this on cluster s04-mc154 after following the ODF install guide provided by the team. I had put in the incorrect user/password for the FlashSystem storage, and couldn't immediately find if there was a place to change them, so I decided to delete the created StorageSystems and start over. Deleting the "ibm-flashsystem-storage-storagesystem" resource worked fine, but "ocs-storagecluster-storagesystem" has been stuck in Terminating status since yesterday.

It appears to be waiting on a CephObjectStoreUser to be deleted. I tried manually running oc delete CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser but it hangs too, even with the --force option. Then, I tried the cleanup script, which also hangs.

[adstew@mc154 ~]$ oc describe storagesystem ocs-storagecluster-storagesystem
Name:         ocs-storagecluster-storagesystem
Namespace:    openshift-storage
Labels:       <none>
Annotations:  <none>
API Version:  odf.openshift.io/v1alpha1
Kind:         StorageSystem
Metadata:
  Creation Timestamp:             2021-09-20T19:10:30Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2021-09-20T19:33:14Z
  Finalizers:
    storagesystem.odf.openshift.io
  Generation:  2
  Managed Fields:
    API Version:  odf.openshift.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"storagesystem.odf.openshift.io":
      f:spec:
        .:
        f:kind:
        f:name:
        f:namespace:
      f:status:
        .:
        f:conditions:
        f:relatedObjects:
    Manager:         manager
    Operation:       Update
    Time:            2021-09-20T19:10:31Z
  Resource Version:  210396127
  UID:               d6b7de0b-fab5-42e8-9cd1-e34822f63882
Spec:
  Kind:       storagecluster.ocs.openshift.io/v1
  Name:       ocs-storagecluster
  Namespace:  openshift-storage
Status:
  Conditions:
    Last Heartbeat Time:   2021-09-21T19:31:53Z
    Last Transition Time:  2021-09-20T19:33:14Z
    Message:               Deletion is in progress
    Reason:                Deleting
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2021-09-21T19:31:53Z
    Last Transition Time:  2021-09-20T19:33:14Z
    Message:               Deletion is in progress
    Reason:                Deleting
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2021-09-21T19:31:53Z
    Last Transition Time:  2021-09-20T19:10:30Z
    Message:               StorageSystem CR is valid
    Reason:                Valid
    Status:                False
    Type:                  StorageSystemInvalid
    Last Heartbeat Time:   2021-09-20T19:27:51Z
    Last Transition Time:  2021-09-20T19:10:31Z
    Reason:                Ready
    Status:                True
    Type:                  VendorCsvReady
    Last Heartbeat Time:   2021-09-20T19:27:51Z
    Last Transition Time:  2021-09-20T19:10:31Z
    Reason:                Found
    Status:                True
    Type:                  VendorSystemPresent
  Related Objects:
    API Version:       operators.coreos.com/v1alpha1
    Kind:              Subscription
    Name:              ocs-operator-alpha-odf-catalogsource-openshift-storage
    Namespace:         openshift-storage
    Resource Version:  208586016
    UID:               a9b29a81-59c2-4082-8fa7-727565468962
    API Version:       operators.coreos.com/v1alpha1
    Kind:              ClusterServiceVersion
    Name:              ocs-operator.v4.9.0
    Namespace:         openshift-storage
    Resource Version:  208634547
    UID:               7bfb5445-1322-4aa6-9089-02a02556d21d
    API Version:       ocs.openshift.io/v1
    Kind:              StorageCluster
    Name:              ocs-storagecluster
    Namespace:         openshift-storage
    Resource Version:  208632033
    UID:               c143af84-e06a-48e4-b44a-9d4587d7c96b
Events:
  Type     Reason           Age                 From                      Message
  ----     ------           ----                ----                      -------
  Warning  ReconcileFailed  35m (x22 over 24h)  StorageSystem controller  Waiting for storagecluster.ocs.openshift.io/v1 ocs-storagecluster to be deleted
[adstew@mc154 ~]$ oc describe storagecluster ocs-storagecluster
Name:         ocs-storagecluster
Namespace:    openshift-storage
Labels:       <none>
Annotations:  storagesystem.odf.openshift.io/watched-by: ocs-storagecluster-storagesystem
              uninstall.ocs.openshift.io/cleanup-policy: delete
              uninstall.ocs.openshift.io/mode: graceful
API Version:  ocs.openshift.io/v1
Kind:         StorageCluster
Metadata:
  Creation Timestamp:             2021-09-20T19:10:30Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2021-09-20T19:33:14Z
  Finalizers:
    storagecluster.ocs.openshift.io
  Generation:  4
  Managed Fields:
    API Version:  ocs.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:arbiter:
        f:encryption:
          .:
          f:kms:
        f:nodeTopologies:
        f:resources:
          .:
          f:mds:
            .:
            f:limits:
              .:
              f:cpu:
              f:memory:
            f:requests:
              .:
              f:cpu:
              f:memory:
          f:rgw:
            .:
            f:limits:
              .:
              f:cpu:
              f:memory:
            f:requests:
              .:
              f:cpu:
              f:memory:
    Manager:      Mozilla
    Operation:    Update
    Time:         2021-09-20T19:10:30Z
    API Version:  ocs.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:storagesystem.odf.openshift.io/watched-by:
    Manager:      manager
    Operation:    Update
    Time:         2021-09-20T19:10:31Z
    API Version:  ocs.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:uninstall.ocs.openshift.io/cleanup-policy:
          f:uninstall.ocs.openshift.io/mode:
        f:finalizers:
          .:
          v:"storagecluster.ocs.openshift.io":
      f:spec:
        f:externalStorage:
        f:managedResources:
          .:
          f:cephBlockPools:
          f:cephConfig:
          f:cephDashboard:
          f:cephFilesystems:
          f:cephObjectStoreUsers:
          f:cephObjectStores:
        f:storageDeviceSets:
        f:version:
      f:status:
        .:
        f:conditions:
        f:failureDomain:
        f:failureDomainKey:
        f:failureDomainValues:
        f:images:
          .:
          f:ceph:
            .:
            f:actualImage:
            f:desiredImage:
          f:noobaaCore:
            .:
            f:desiredImage:
          f:noobaaDB:
            .:
            f:desiredImage:
        f:nodeTopologies:
          .:
          f:labels:
            .:
            f:kubernetes.io/hostname:
            f:topology.rook.io/rack:
        f:phase:
        f:relatedObjects:
    Manager:         ocs-operator
    Operation:       Update
    Time:            2021-09-20T19:21:28Z
  Resource Version:  208642636
  UID:               c143af84-e06a-48e4-b44a-9d4587d7c96b
Spec:
  Arbiter:
  Encryption:
    Kms:
  External Storage:
  Managed Resources:
    Ceph Block Pools:
    Ceph Config:
    Ceph Dashboard:
    Ceph Filesystems:
    Ceph Object Store Users:
    Ceph Object Stores:
  Node Topologies:
  Resources:
    Mds:
      Limits:
        Cpu:     3
        Memory:  8Gi
      Requests:
        Cpu:     1
        Memory:  8Gi
    Rgw:
      Limits:
        Cpu:     2
        Memory:  4Gi
      Requests:
        Cpu:     1
        Memory:  4Gi
  Storage Device Sets:
    Config:
    Count:  1
    Data PVC Template:
      Metadata:
      Spec:
        Access Modes:
          ReadWriteOnce
        Resources:
          Requests:
            Storage:         512Gi
        Storage Class Name:  flashsystem-sc
        Volume Mode:         Block
      Status:
    Name:  ocs-deviceset-flashsystem-sc
    Placement:
    Portable:  true
    Prepare Placement:
    Replica:  3
    Resources:
      Limits:
        Cpu:     2
        Memory:  5Gi
      Requests:
        Cpu:     1
        Memory:  5Gi
  Version:       1.16
Status:
  Conditions:
    Last Heartbeat Time:   2021-09-20T19:30:41Z
    Last Transition Time:  2021-09-20T19:10:31Z
    Message:               Error while reconciling: some StorageClasses [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd,ocs-storagecluster-ceph-rbd-thick] were skipped while waiting for pre-requisites to be met
    Reason:                ReconcileFailed
    Status:                False
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2021-09-20T19:10:31Z
    Last Transition Time:  2021-09-20T19:10:31Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2021-09-20T19:10:31Z
    Last Transition Time:  2021-09-20T19:10:31Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2021-09-20T19:10:31Z
    Last Transition Time:  2021-09-20T19:10:31Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2021-09-20T19:10:31Z
    Last Transition Time:  2021-09-20T19:10:31Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                Unknown
    Type:                  Upgradeable
  Failure Domain:          rack
  Failure Domain Key:      topology.rook.io/rack
  Failure Domain Values:
    rack0
    rack1
    rack2
  Images:
    Ceph:
      Actual Image:   ceph/daemon-base:latest-pacific
      Desired Image:  ceph/daemon-base:latest-pacific
    Noobaa Core:
      Desired Image:  noobaa/noobaa-core:master-20210609
    Noobaa DB:
      Desired Image:  centos/postgresql-12-centos7
  Node Topologies:
    Labels:
      kubernetes.io/hostname:
        mc159
        mc160
        mc155
        mc156
        mc157
        mc158
      topology.rook.io/rack:
        rack0
        rack1
        rack2
  Phase:  Deleting
  Related Objects:
    API Version:       ceph.rook.io/v1
    Kind:              CephCluster
    Name:              ocs-storagecluster-cephcluster
    Namespace:         openshift-storage
    Resource Version:  208633745
    UID:               54bc7e77-fda7-41e3-94c2-2858b9118570
Events:
  Type     Reason            Age                 From                       Message
  ----     ------            ----                ----                       -------
  Warning  UninstallPending  83s (x25 over 24h)  controller_storagecluster  uninstall: Waiting for CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser to be deleted
[adstew@mc154 ~]$ oc get CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser
NAME                                     AGE
ocs-storagecluster-cephobjectstoreuser   24h
[adstew@mc154 ~]$
stewartad commented 3 years ago

I was able to get around this by running oc edit and deleting any finalizers from each resource

shdn-ibm commented 3 years ago

remove the finilizer is the workaround. follow below uninstall guide https://ibm.ent.box.com/notes/860252064802

MonicaLemay commented 3 years ago

Is this a temporary workaround? Or will this be something that customers will have to do and be documented in official docs?

Also, I just looked that the box not mentioned and did a search on the word "finalizer". I could not find it. Also, I can't find an "uninstall guide" Can you paste an image of the uninstall guide or point more specifically where it is doc'ed

stewartad commented 3 years ago

As Monica mentioned, I do not see an uninstall guide, only the link to the script, which got stuck when I tried it.

shdn-ibm commented 3 years ago

The reason why hanging at finalizer is that, the installation is not successfully, and ocs operator is trying to remove something or failures. In real customer case, users need to wait until ocs operator complete the cleanup, or if there is any failure, they need to report bug to ocs. Removing the finializer is not a good method in real customer case as it may leave something in the cluster. Anyway, keep this issue open, and hold the system when this occurs and I can have a deep investigation.

shdn-ibm commented 3 years ago

cleanup guide is updated. If no capture of the same issue, we can close this.