Failed to create kubeflow cluster preemptible instance or spot when given minMasterVersion because timeout #401

Closed ghost closed 1 year ago

Hello 👋 This issue is about marking out two separate issues with the Kubeflow distribution.

Problem Statements

First, I'm trying to distribute the management cluster (there is no issue 1.24.X-Y) and Kubeflow (1.6.1) on GCP preemptible or spot instance via GitLab automated pipeline and able to deploy Kubernetes version 1.24 for management cluster and Kubernetes version 1.23 for Kubeflow without an issue but when I added preemptible or spot under the spec it doesn't create my nodes as a preemptible instance or spot.

For below nodepool.yaml I also tested spot: true and preemptible without taints.

common/crnm/nodepool.yaml

spec:
  location: LOCATION
  initialNodeCount: 2
  autoscaling:
    minNodeCount: 2
    maxNodeCount: 8
  nodeConfig:
    machineType: n1-standard-8
    minCpuPlatform: Intel Broadwell
    diskSizeGb: 150
    diskType: pd-standard
    preemptible: true
    taint:
    - effect: NO_SCHEDULE
      key: preemptible
      value: "true"
    oauthScopes:
    - "https://www.googleapis.com/auth/logging.write"
    - "https://www.googleapis.com/auth/monitoring"
    - "https://www.googleapis.com/auth/devstorage.read_only"

Secondly, when I added minMasterVersion pipelines stucks on this step, gives timeout after 60 mins.

common/cnrm/cluster-patch.yaml

spec:
  location: LOCATION
  minMasterVersion: '1.22'
  nodeVersion: "1.22"

Pipeline output:

1776 iampolicymember.iam.cnrm.cloud.google.com/kubeflow-vm-logging condition met
1777 iampolicymember.iam.cnrm.cloud.google.com/kubeflow-vm-policy-cloudtrace condition met
1778 iampolicymember.iam.cnrm.cloud.google.com/kubeflow-vm-policy-meshtelemetry condition met
1779 iampolicymember.iam.cnrm.cloud.google.com/kubeflow-vm-policy-monitoring condition met
1780 iampolicymember.iam.cnrm.cloud.google.com/kubeflow-vm-policy-monitoring-viewer condition met
1781 iampolicymember.iam.cnrm.cloud.google.com/kubeflow-vm-policy-storage condition met
1782 Waiting for computeaddress resources...
1783 computeaddress.compute.cnrm.cloud.google.com/kubeflow-ip condition met
1784 Waiting for containercluster resources...

/assign @gkcalat

Hi @archangelita! Thank you for reporting this.

Kubeflow of Google Cloud has not been tested on GKE above 1.22-1.23. In fact, we usually target the default version in the STABLE release channel.

Kubeflow on Google Cloud 1.7 will have support for GKE 1.24 (and maybe 1.25). The release date is scheduled around April 2023.

You can also try changing the GKE channel to REGULAR release channel, which has 1.24 as default.

Hello, I have also encountered that problem so I followed some of the steps, and then I successfully deployed preemptible nodes.

I turned remove-default-node-pool true annotation to true in cnrm/cluster-path , because I would like to implement preemptible instance with a new node pool.
Comment out default node pool configs in common/cluster/upstream/cluster.yaml. Because we are removing the default node pool.
I need to implement a new node-pool, so I have uncommented nodepool-example.yaml in common/cnrm/kustomization.yaml
Lastly you have to change serviceAccountRef.name fields in common/cluster/upstream/cluster.yaml and common/cnrm/nodepool-example.yaml to external. Because when you look up cluster reference, you will see external field is for service account mail. Name is not a field for mail just for the name.

Here are the files;

common/cluster/upstream/cluster.yaml;

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata:
  clusterName: "PROJECT/LOCATION/KUBEFLOW-NAME" # kpt-set: ${gcloud.core.project}/${location}/${name}
  labels:
    mesh_id: "proj-PROJECT_NUMBER" # kpt-set: proj-${gcloud.project.projectNumber}
  name: KUBEFLOW-NAME # kpt-set: ${name}
spec:
  initialNodeCount: 2
  addonsConfig:
    httpLoadBalancing:
      disabled: false
  clusterAutoscaling:
    enabled: true
    autoProvisioningDefaults:
      oauthScopes:
      - https://www.googleapis.com/auth/logging.write
      - https://www.googleapis.com/auth/monitoring
      - https://www.googleapis.com/auth/devstorage.read_only
      serviceAccountRef:
        external: KUBEFLOW-NAME-vm@PROJECT.iam.gserviceaccount.com # kpt-set: ${name}-vm@${gcloud.core.project}.iam.gserviceaccount.com
    resourceLimits:
    - resourceType: cpu
      maximum: 128
    - resourceType: memory
      maximum: 2000
    - resourceType: nvidia-tesla-t4
      maximum: 16
  releaseChannel:
    channel: STABLE
  location: LOCATION # kpt-set: ${location}
  workloadIdentityConfig:
    identityNamespace: PROJECT.svc.id.goog # kpt-set: ${gcloud.core.project}.svc.id.goog
  loggingService: logging.googleapis.com/kubernetes
  monitoringService: monitoring.googleapis.com/kubernetes

common/cnrm/cluster-patch.yaml;

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata:
  name: KUBEFLOW-NAME # kpt-set: ${name}
  annotations:
    cnrm.cloud.google.com/remove-default-node-pool: "true"
spec:
  location: LOCATION # kpt-set: ${location}
  nodeLocations:
  - LOCATION # kpt-set: ${location}-b

common/cnrm/kustomization.yaml;

namespace: PROJECT # kpt-set: ${gcloud.core.project}
commonLabels:
  kf-name: KUBEFLOW-NAME # kpt-set: ${name}
resources:
# The default Google Cloud resources.
- ../cluster/upstream
- ../ingress/upstream
- ../iam/upstream
- gcp-services.yaml
- nodepool.yaml

patchesStrategicMerge:
- cluster-patch.yaml

common/cnrm/nodepool.yaml;

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  name: KUBEFLOW-NAME-example # kpt-set: ${name}-example
spec:
  location: LOCATION # kpt-set: ${location}
  nodeLocations:
  - LOCATION # kpt-set: ${location}-b
  initialNodeCount: 4
  autoscaling:
    minNodeCount: 2
    maxNodeCount: 8
  nodeConfig:
    machineType: n1-standard-8
    minCpuPlatform: Intel Broadwell
    preemptible: true
    guestAccelerator:
    - type: "nvidia-tesla-t4"
      count: 1
    metadata:
      disable-legacy-endpoints: "true"
    serviceAccountRef:
      external: KUBEFLOW-NAME-vm@PROJECT.iam.gserviceaccount.com # kpt-set: ${name}-vm@${gcloud.core.project}.iam.gserviceaccount.com
    workloadMetadataConfig:
      nodeMetadata: GKE_METADATA_SERVER
    oauthScopes:
      - https://www.googleapis.com/auth/logging.write
      - https://www.googleapis.com/auth/monitoring
      - https://www.googleapis.com/auth/devstorage.read_only
  clusterRef:
    name: KUBEFLOW-NAME # kpt-set: ${name}

After that configuration I hope your preemptible node will appear in Kubeflow cluster.

GoogleCloudPlatform / kubeflow-distribution

Failed to create kubeflow cluster preemptible instance or spot when given minMasterVersion because timeout #401