apecloud / kubeblocks

KubeBlocks is an open-source control plane software that runs and manages databases, message queues and other stateful applications on K8s.
https://kubeblocks.io
GNU Affero General Public License v3.0
2.08k stars 170 forks source link

[BUG] ggml cluster is always Deleting upgrade from kb 0.8.3 to 0.9.0 #7687

Closed JashBook closed 3 months ago

JashBook commented 3 months ago

Describe the bug

kbcli version
Kubernetes: v1.27.13-eks-3af4770
KubeBlocks: 0.9.0-beta.42
kbcli: 0.9.0-beta.1

To Reproduce Steps to reproduce the behavior:

  1. install kb 0.8.3
    
    curl -fsSL https://kubeblocks.io/installer/install_cli.sh | bash -s v0.8.4-beta.1

kbcli kubeblocks install --create-namespace --version 0.8.3 --set image.registry=docker.io --set dataProtection.image.registry=docker.io --set addonChartsImage.registry=docker.io --set dataProtection.image.datasafed.tag=0.1.0 --namespace kb-ohfmhs

2. create ggml cluster

kbcli addon enable llm

kbcli cluster create ggml-ohfmhs --termination-policy=Halt --monitoring-interval=0 --cluster-definition=ggml --enable-all-logs=false --cluster-version=ggml-baichuan2-13b-q4 --set cpu=1000m,memory=6Gi,replicas=1,storage=20Gi --namespace ns-ohfmhs

3. upgrade kb to 0.9.0

curl -fsSL https://kubeblocks.io/installer/install_cli.sh | bash -s v0.9.0-beta.2

kbcli kubeblocks upgrade --auto-approve --set upgradeAddons=true --version 0.9.0-beta.42 --set image.registry=docker.io --set dataProtection.image.registry=docker.io --set addonChartsImage.registry=docker.io --set dataProtection.image.datasafed.tag=0.2.0 --namespace kb-ohfmhs

kbcli addon list llm NAME VERSION PROVIDER STATUS AUTO-INSTALL
llm 0.9.0 community Enabled false

4. stop

kbcli cluster stop ggml-ohfmhs --auto-approve --force=true --namespace ns-ohfmhs

kbcli cluster start ggml-ohfmhs --force=true --namespace ns-ohfmhs

kbcli cluster delete ggml-ohfmhs --auto-approve --namespace ns-ohfmhs

5. See error

kubectl get cluster -n ns-ohfmhs ggml-ohfmhs
NAME CLUSTER-DEFINITION VERSION TERMINATION-POLICY STATUS AGE ggml-ohfmhs ggml ggml-baichuan2-13b-q4 WipeOut Deleting 3h15m

kubectl get pod -l app.kubernetes.io/instance=ggml-ohfmhs -n ns-ohfmhs NAME READY STATUS RESTARTS AGE ggml-ohfmhs-ggml-0 1/1 Running 0 106m

kubectl get cmp -l app.kubernetes.io/instance=ggml-ohfmhs -n ns-ohfmhs NAME DEFINITION SERVICE-VERSION STATUS AGE ggml-ohfmhs-ggml Updating 3h17m

➜ ~ kubectl get its -l app.kubernetes.io/instance=ggml-ohfmhs -n ns-ohfmhs NAME LEADER READY REPLICAS AGE ggml-ohfmhs-ggml 1 1 124m

describe cluster

kubectl describe cluster -n ns-ohfmhs ggml-ohfmhs
Name: ggml-ohfmhs Namespace: ns-ohfmhs Labels: app.kubernetes.io/instance=ggml-ohfmhs clusterdefinition.kubeblocks.io/name=ggml clusterversion.kubeblocks.io/name=ggml-baichuan2-13b-q4 Annotations: API Version: apps.kubeblocks.io/v1alpha1 Kind: Cluster Metadata: Creation Timestamp: 2024-07-01T08:03:31Z Deletion Grace Period Seconds: 0 Deletion Timestamp: 2024-07-01T09:59:33Z Finalizers: cluster.kubeblocks.io/finalizer Generation: 7 Managed Fields: API Version: apps.kubeblocks.io/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:labels: f:app.kubernetes.io/instance: f:spec: .: f:affinity: .: f:podAntiAffinity: f:tenancy: f:clusterDefinitionRef: f:clusterVersionRef: f:monitor: f:resources: .: f:cpu: f:memory: f:storage: .: f:size: f:terminationPolicy: Manager: kbcli Operation: Update Time: 2024-07-01T09:25:36Z API Version: apps.kubeblocks.io/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: v:"cluster.kubeblocks.io/finalizer": f:labels: .: f:clusterdefinition.kubeblocks.io/name: f:clusterversion.kubeblocks.io/name: f:spec: f:componentSpecs: Manager: manager Operation: Update Time: 2024-07-01T09:32:20Z API Version: apps.kubeblocks.io/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:clusterDefGeneration: f:components: .: f:ggml: .: f:phase: f:podsReady: f:podsReadyTime: f:conditions: f:observedGeneration: f:phase: Manager: manager Operation: Update Subresource: status Time: 2024-07-01T09:59:36Z Resource Version: 245454 UID: 84df1843-1efa-4d5d-9888-a172146d4c99 Spec: Affinity: Pod Anti Affinity: Preferred Tenancy: SharedNode Cluster Definition Ref: ggml Cluster Version Ref: ggml-baichuan2-13b-q4 Component Specs: Component Def Ref: ggml Monitor: false Name: ggml Replicas: 1 Resources: Limits: Cpu: 1 Memory: 6Gi Requests: Cpu: 1 Memory: 6Gi Service Account Name: kb-ggml-ohfmhs Volume Claim Templates: Name: data Spec: Access Modes: ReadWriteOnce Resources: Requests: Storage: 20Gi Resources: Cpu: 0 Memory: 0 Storage: Size: 0 Termination Policy: WipeOut Status: Cluster Def Generation: 2 Components: Ggml: Phase: Updating Pods Ready: false Pods Ready Time: 2024-07-01T09:31:03Z Conditions: Last Transition Time: 2024-07-01T09:43:10Z Message: the referenced ClusterDefinition is not up to date: ggml Reason: PreCheckFailed Status: False Type: ProvisioningStarted Last Transition Time: 2024-07-01T08:03:31Z Message: Successfully applied for resources Observed Generation: 6 Reason: ApplyResourcesSucceed Status: True Type: ApplyResources Last Transition Time: 2024-07-01T09:32:27Z Message: pods are not ready in Components: [ggml], refer to related component message in Cluster.status.components Reason: ReplicasNotReady Status: False Type: ReplicasReady Last Transition Time: 2024-07-01T09:32:27Z Message: pods are unavailable in Components: [ggml], refer to related component message in Cluster.status.components Reason: ComponentsNotReady Status: False Type: Ready Observed Generation: 6 Phase: Deleting Events: Type Reason Age From Message


Normal DeletingCR 2m55s (x23 over 80m) cluster-controller Deleting : ggml-ohfmhs Warning Warning 2m55s (x23 over 80m) cluster-controller the referenced ClusterDefinition is not up to date: ggml ➜ ~


logs kubeblocks pod

2024-07-01T10:44:10.292Z INFO the referenced ClusterDefinition is not up to date: ggml {"controller": "cluster", "controllerGroup": "apps.kubeblocks.io", "controllerKind": "Cluster", "Cluster": {"name":"ggml-ohfmhs","namespace":"ns-ohfmhs"}, "namespace": "ns-ohfmhs", "name": "ggml-ohfmhs", "reconcileID": "2881659b-df9a-4af4-abb6-b0802dfade55", "cluster": {"name":"ggml-ohfmhs","namespace":"ns-ohfmhs"}} 2024-07-01T10:44:10.292Z ERROR Reconciler error {"controller": "cluster", "controllerGroup": "apps.kubeblocks.io", "controllerKind": "Cluster", "Cluster": {"name":"ggml-ohfmhs","namespace":"ns-ohfmhs"}, "namespace": "ns-ohfmhs", "name": "ggml-ohfmhs", "reconcileID": "2881659b-df9a-4af4-abb6-b0802dfade55", "error": "the referenced ClusterDefinition is not up to date: ggml"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227

get cd

kubectl get cd NAME TOPOLOGIES SERVICEREFS STATUS AGE ggml Available 3h15m

get cd yaml

kubectl get cd ggml -oyaml apiVersion: apps.kubeblocks.io/v1alpha1 kind: ClusterDefinition metadata: annotations: meta.helm.sh/release-name: kb-addon-llm meta.helm.sh/release-namespace: kb-ohfmhs creationTimestamp: "2024-07-01T08:03:15Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2024-07-01T09:33:43Z" finalizers:



**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Desktop (please complete the following information):**
 - OS: [e.g. iOS]
 - Browser [e.g. chrome, safari]
 - Version [e.g. 22]

**Additional context**
Add any other context about the problem here.
leon-inf commented 3 months ago

duplicate with #7686