GoogleCloudPlatform / kubeflow-distribution

Blueprints for Deploying Kubeflow on Google Cloud Platform and Anthos
Apache License 2.0
75 stars 63 forks source link

Cannot deploy Kubefow 1.2.0 in a custom VPC #199

Open krish0005 opened 3 years ago

krish0005 commented 3 years ago

See comments already posted here: https://github.com/kubeflow/manifests/issues/1577#issuecomment-777587661

My cluster.yaml as follows:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata:
  name: aip-kubeflow-mgmt # {"$kpt-set":"name"}
  annotations:
    cnrm.cloud.google.com/remove-default-node-pool: "true"
spec:
  location: europe-west1-b # {"$kpt-set":"location"}
  initialNodeCount: 3
  workloadIdentityConfig:
    identityNamespace: acn-cpaascgp.svc.id.goog # {"$kpt-set":"wi-pool"}
  releaseChannel:
    channel: REGULAR
  networkRef:
    external: https://www.googleapis.com/compute/v1/projects/myproject/global/networks/myproject-dev
  subnetworkRef:
    external: https://www.googleapis.com/compute/v1/projects/myproject/regions/europe-west1-b/myproject/public

What i get when i build:

make apply-cluster

make -f ./upstream/management/Makefile hydrate-cluster
make[1]: Entering directory `/home/myproject-dev/kubeflow/myproject-gcp-dex-mgmt_cluster/management'
The management cluster name "myproject-mgmt" is valid.
# Delete the directory so any resources that have been removed
# from the manifests will be pruned
rm -rf build/cluster
mkdir -p build/cluster
kustomize build ./instance/cluster -o build/cluster
make[1]: Leaving directory `/home/myproject-dev/kubeflow/myproject-gcp-dex-mgmt_cluster/management'
make -f ./upstream/management/Makefile apply-cluster
make[1]: Entering directory `/home/myproject-dev/kubeflow/myproject-gcp-dex-mgmt_cluster/management'
# Create the cluster
anthoscli apply -f build/cluster
I0211 15:34:40.932055   15959 main.go:230] reconcile serviceusage.cnrm.cloud.google.com/Service container.googleapis.com
I0211 15:34:42.070023   15959 main.go:230] reconcile container.cnrm.cloud.google.com/ContainerCluster myproject-mgmt
Unexpected error: error reconciling objects: error reconciling ContainerCluster:mygcpproject/myproject-mgmt: error creating GKE cluster myproject-mgmt: googleapi: Error 400: Project "mygcpproject" has no network named "default".
make[1]: *** [apply-cluster] Error 1
make[1]: Leaving directory `/home/myproject-dev/kubeflow/myproject-gcp-dex-mgmt_cluster/management'
make: *** [apply-cluster] Error 2
PatrickXYS commented 3 years ago

/cc @Bobgy

Who will be taking care of GCP related issue

Bobgy commented 3 years ago

Hey @krish0005, sorry for the delayed reply. You might want to raise an issue in https://github.com/GoogleCloudPlatform/k8s-config-connector, which is open source repo for https://cloud.google.com/config-connector/docs/overview

krish0005 commented 3 years ago

Hi @Bobgy,

thanks a lot for taking this and having replied. I contacted the Google Support after opening this GitHub Issue and they verified the following (citing sparsely from their reply):

the error is coming from gcloud beta anthos apply and not directly from config connector

I have been able to reproduce this - using gcloud beta anthos apply and your cluster.yaml. Even when the network is already created the cluster uses the default network rather than a custom network. I also tried to create a network with gcloud beta anthos apply and the network resource is not created - it could be that gcloud beta anthos apply has limited support for compute resources such as networks or sub-networks.

I have reported the issue to the product team but please note that this command is in beta therefore I cannot give a date for when a fix is likely to be implemented.

Please note that even if they commit to a fix there is no guarantee on when that would likely be implemented as this is in beta

So the cluster resource is defined in the config connector format but not applied using config connector as config connector is not yet installed in the cluster or running in another cluster in the project. The resources are being deployed using gcloud beta anthos apply - this understands config connector format but does not seem to work in exactly the same way.

The errors you see are returned by the gcloud beta anthos apply command. I am trying to reproduce applying a cluster config using gcloud beta anthos apply to see if this also ignores the custom network and uses the default network.

Would you be able to replicate this issue on your side and confirm it has to do with gcloud beta anthos apply ? As you see i already required Support to open internal issue to Anthos Product Team (opened on the 22 Feb) but as of today there is no update from Anthos Product Team.

Bobgy commented 3 years ago

Hi @krish0005, this has to do with gcloud beta anthos apply. (My mistake, I overlooked you got the error when deploying the management cluster. So the problem is with gcloud beta anthos apply, but not config-connector.)

and in this case, you can simply modify Makefile to replace the apply management cluster step using a gcloud command to create the cluster, just make sure it will create the same spec as your cluster.yaml. (The deployment process is designed that making changes to it is easy --- we use a Makefile to automate it.)

and then you can continue the rest of installation as usual