Closed ghost closed 1 year ago
/assign @gkcalat
Hi @archangelita! Thank you for reporting this.
Kubeflow of Google Cloud has not been tested on GKE above 1.22-1.23. In fact, we usually target the default version in the STABLE
release channel.
Kubeflow on Google Cloud 1.7 will have support for GKE 1.24 (and maybe 1.25). The release date is scheduled around April 2023.
You can also try changing the GKE channel to REGULAR
release channel, which has 1.24 as default.
Hello, I have also encountered that problem so I followed some of the steps, and then I successfully deployed preemptible nodes.
Here are the files;
common/cluster/upstream/cluster.yaml;
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata:
clusterName: "PROJECT/LOCATION/KUBEFLOW-NAME" # kpt-set: ${gcloud.core.project}/${location}/${name}
labels:
mesh_id: "proj-PROJECT_NUMBER" # kpt-set: proj-${gcloud.project.projectNumber}
name: KUBEFLOW-NAME # kpt-set: ${name}
spec:
initialNodeCount: 2
addonsConfig:
httpLoadBalancing:
disabled: false
clusterAutoscaling:
enabled: true
autoProvisioningDefaults:
oauthScopes:
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/devstorage.read_only
serviceAccountRef:
external: KUBEFLOW-NAME-vm@PROJECT.iam.gserviceaccount.com # kpt-set: ${name}-vm@${gcloud.core.project}.iam.gserviceaccount.com
resourceLimits:
- resourceType: cpu
maximum: 128
- resourceType: memory
maximum: 2000
- resourceType: nvidia-tesla-t4
maximum: 16
releaseChannel:
channel: STABLE
location: LOCATION # kpt-set: ${location}
workloadIdentityConfig:
identityNamespace: PROJECT.svc.id.goog # kpt-set: ${gcloud.core.project}.svc.id.goog
loggingService: logging.googleapis.com/kubernetes
monitoringService: monitoring.googleapis.com/kubernetes
common/cnrm/cluster-patch.yaml;
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata:
name: KUBEFLOW-NAME # kpt-set: ${name}
annotations:
cnrm.cloud.google.com/remove-default-node-pool: "true"
spec:
location: LOCATION # kpt-set: ${location}
nodeLocations:
- LOCATION # kpt-set: ${location}-b
common/cnrm/kustomization.yaml;
namespace: PROJECT # kpt-set: ${gcloud.core.project}
commonLabels:
kf-name: KUBEFLOW-NAME # kpt-set: ${name}
resources:
# The default Google Cloud resources.
- ../cluster/upstream
- ../ingress/upstream
- ../iam/upstream
- gcp-services.yaml
- nodepool.yaml
patchesStrategicMerge:
- cluster-patch.yaml
common/cnrm/nodepool.yaml;
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
name: KUBEFLOW-NAME-example # kpt-set: ${name}-example
spec:
location: LOCATION # kpt-set: ${location}
nodeLocations:
- LOCATION # kpt-set: ${location}-b
initialNodeCount: 4
autoscaling:
minNodeCount: 2
maxNodeCount: 8
nodeConfig:
machineType: n1-standard-8
minCpuPlatform: Intel Broadwell
preemptible: true
guestAccelerator:
- type: "nvidia-tesla-t4"
count: 1
metadata:
disable-legacy-endpoints: "true"
serviceAccountRef:
external: KUBEFLOW-NAME-vm@PROJECT.iam.gserviceaccount.com # kpt-set: ${name}-vm@${gcloud.core.project}.iam.gserviceaccount.com
workloadMetadataConfig:
nodeMetadata: GKE_METADATA_SERVER
oauthScopes:
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/devstorage.read_only
clusterRef:
name: KUBEFLOW-NAME # kpt-set: ${name}
After that configuration I hope your preemptible node will appear in Kubeflow cluster.
Hello 👋 This issue is about marking out two separate issues with the Kubeflow distribution.
Problem Statements
First, I'm trying to distribute the management cluster (there is no issue 1.24.X-Y) and Kubeflow (1.6.1) on GCP preemptible or spot instance via GitLab automated pipeline and able to deploy Kubernetes version 1.24 for management cluster and Kubernetes version 1.23 for Kubeflow without an issue but when I added preemptible or spot under the spec it doesn't create my nodes as a preemptible instance or spot.
For below nodepool.yaml I also tested spot: true and preemptible without taints.
common/crnm/nodepool.yaml
Secondly, when I added minMasterVersion pipelines stucks on this step, gives timeout after 60 mins.
common/cnrm/cluster-patch.yaml
Pipeline output: