GoogleCloudPlatform / kubeflow-distribution

Blueprints for Deploying Kubeflow on Google Cloud Platform and Anthos
Apache License 2.0
78 stars 63 forks source link

Documentation - we need an KF 1.1 on GCP overview page #123

Open Bobgy opened 4 years ago

Bobgy commented 4 years ago

I've been getting quite some questions from different channels asking about various questions.

I think current documentation only explained how to deploy Kubeflow 1.1, but it doesn't touch topics below:

These are pretty much all I have in mind right now, there's probably more

UPDATE 8.24

I edited this and added quick answers below.

Where is kfctl?

kfctl is deprecated for Google Cloud. The decision is specific to Google Cloud, other platforms may continue to use kfctl.

Why did we stop using kfctl?

There were multiple reasons:

Therefore, we are removing the extra layer of abstraction in kfctl and providing a simple Makefile (that is supposed to be easier to understand and customize) which leverages generic tools (kustomize, kpt and Cloud Config Connector) to deploy Kubeflow.

What is Anthos Service Mesh? Can we replace it with istio? How much does it cost?

Anthos Service Mesh is managed istio on anthos. It doesn't add extra abstractions, you can still use the CRDs in open source istio with Anthos Service Mesh and there are more observability..etc features built in with Google Cloud. Therefore, you should be able to swap it for istio 1.4 if you prefer avoiding it (maybe because of extra cost). I don't have an answer to how much it costs yet, it might require an Anthos subscription. Recommend asking Google Cloud sales about it. Welcome contribution if anyone got it working with OSS istio 1.4.

What is cloud config connector/management cluster? Why do we use it?

Cloud config connector is introduced in https://cloud.google.com/config-connector/docs/overview.

Config Connector is a Kubernetes addon that allows you to manage Google Cloud resources through Kubernetes. Config Connector provides a collection of Kubernetes Custom Resource Definitions (CRDs) and controllers. The Config Connector CRDs allow Kubernetes to create and manage Google Cloud resources when you configure and apply Objects to your cluster.

So, basically Config Connector makes it possible to manage Kubernetes resources using yaml files in Kubernetes CRDs. The Kubeflow 1.1 default setup is to installing Config Connector into a lightweight management cluster (which only contains a single node with 4 CPUs and 15GB memory). You can choose to delete the management cluste or scale it down to save costs after Kubeflow is deployed.

Before KF 1.1, GCP was using https://cloud.google.com/deployment-manager (DM) for Google Cloud resources, but some problems of it were solved by Config Connector:

In summary, our vision for switching to Cloud Config Connector is that it empowers a unified workflow using kustomize and kpt for both Google Cloud resources and Kubernetes resources that Kubeflow relies on. The workflow now supports day 2 operations (customize + upgrade at the same time).

How to troubleshoot Cloud Config Connector?

You can use kubectl to query resource status, they will have detailed error messages. e.g.

# switch to management cluster context
kubectl config use-context $MANAGEMENT

# list managed clusters
kubectl get containercluster -n $PROJECT
# debug a certain cluster
kubectl describe containercluster --context $MANAGEMENT -n $PROJECT <cluster-name>

# list service accounts
kubectl get iamserviceaccount -n $PROJECT
# debug a certain service account
kubectl describe iamserviceaccount -n $PROJECT <service-account-name>

How to customize Google Cloud resources?

You can use kustomize to add customizations in your ./kubeflow/instance/gcp_config folder. You can find the following content in ./kubeflow/instance/gcp_config/kustomization.yaml:

resources:
- ../../upstream/manifests/gcp/v2/cnrm

It means the kustomization.yaml includes resources defined in files in that relative folder. So you can go to that folder ./kubeflow/upstream/manifests/gcp/v2/cnrm to take a look at what the base template looks like.

e.g. you may add a patches using patchesStrategicMerge and write partial yaml files that only contain fields you want to change. kustomize documentation: https://kustomize.io/ Cloud Config Connector resource spec documentation: https://cloud.google.com/config-connector/docs/how-to/creating-resource-references You can find all specs in https://cloud.google.com/config-connector/docs/reference/resources

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
area/kfctl 0.52
kind/question 0.74
platform/gcp 0.89

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Bobgy commented 4 years ago

/cc @jlewi @8bitmp3 What do you think?

jlewi commented 4 years ago

Overview or FAQ?

8bitmp3 commented 4 years ago

đź’Żđź‘Ť!

@Bobgy

What is Anthos Service Mesh?

I can share some of my writing (from May 2020) that we can shorten and include in the @Bobgy @jlewi 's proposed KF 1.1 on GCP Overview/FAQ. I put this together earlier this year to figure out how these GCP parts are related to each other:

"Anthos Service Mesh provides operational control and insights over a service mesh—a network of microservices that make up the applications and the interactions between them. It enables users to get a uniform observability into the workloads, so that they can make informed decisions on routing traffic, security and encryption policy enforcement, and other rule configurations."

"Anthos Service Mesh is a network for services that manages interactions across all services. It uses a distribution of Istio—an open-source implementation of the service mesh infrastructure layer.

... "Anthos is a broad product suite that helps bridge the worlds of on-prem and cloud-based infrastructure."

"Anthos helps you move applications to the cloud-native world with improved workflow management and reduced operational complexity. You can move workloads from on-prem to the cloud and manage the infrastructure with a consistent set of policies and tools.

@Bobgy

Can we replace it with istio? How much does it cost? (I couldn't find related documentation)

Some of this should also help—especially the Istio on GKE and Anthose Service Mesh distinction:

"For each of Google Cloud's fully-managed solutions, such as Google Kubernetes Engine (GKE) and Istio on GKE, Anthos has equivalent platforms for the on-prem, multi-cloud, and hybrid cloud world:

Fully-managed by Google Cloud For on-prem, multi-cloud, hybrid cloud
Kubernetes Engine (GKE) Anthos GKE (On-Prem, for AWS/Azure)
Istio on GKE Anthos Service Mesh (on GKE)

"Anthos GKE and Anthos Service Mesh are built on open-source Kubernetes and Istio."

"These technologies provide solutions for modern infrastructure and application development challenges by:

"- Decoupling of applications for modularity > with Kubernetes and Istio." "- Providing scalable configuration management > with Istio."

This may be helpful in the Istio docs on KF:

"Starting from v1.5 as of Q1 2020, the Istio components in the Control Plane—Pilot, Galley, and Citadel—were consolidated into istiod.

connorlwilkes commented 4 years ago

I think it's a very valid question as to why with GCP there appears to be a slow tie in with Anthos with these GCP blueprints. It would make sense to offer Istio as an alternative if possible as there are costs I understand with the Anthos route and many users may not want to pay for this.

Bobgy commented 4 years ago

@jlewi Do you have context why we chose Anthos Service Mesh as builtin support for Kubeflow?

My understanding is an open source istio 1.4 should be functioning the same if users prefer istio.

Bobgy commented 4 years ago

For cloud config connector and management cluster, my answer is, that's the Google Cloud opinionated way of declarative cloud resources management. You can use yaml files that can be processed by kustomize to define GCP resources.

Things like providing built-in workload identity bindings as manifest are powered by this new ability.

EDIT: also, providing yamls that can be processed by kustomize means we are better supporting day 2 operations, e.g. you might want to customize a GCP resource created by Kubeflow. Now you can create a kustomize overlay in your instance folder to adjust that resource. And the next day, you may want to upgrade to KF 1.1.1, that will be as simple as pulling in upstream manifest for 1.1.1, your customizations are still in your instance folder.

devoftheweb commented 4 years ago

I too don't understand the requirement for adding in the Anthos Service Mesh as an abstraction layer over istio -- when the user is looking to deploy Kubeflow solely on GCP (not multi-cloud or on-prem).

@8bitmp3

I can share some of my writing ...

Yes, that explains what the Anthos Service Mesh is but it doesn't explain:

  1. Why is it explicitly needed in a default Kubeflow GCP deployment?
  2. The decision process behind transitioning to the Anthos abstraction layer?

Particularly # 2, as that decision can have real cost implications for users.

Some of this should also help...

Again, sure, there are many product offerings which can replace open sourced solutions. Ie, for example, Argo can be replaced by another vendor workflow engine.

Another concern would be incurred cost by running the Anthos Service Mesh. As I understand it, it's not inexpensive in comparison to some other GCP services.

devoftheweb commented 4 years ago

I agree with @connorlwilkes.

@Bobgy

my answer is, that's the Google Cloud opinionated way of declarative cloud resources management. You can use yaml files that can be processed by kustomize to define GCP resources.

I'd say Kustomize is increasing in popularity across Kubernetes deployments (vs Helm).

I guess some concerns here are:

  1. The decision to include Anthos Service Mesh as the default deployment for Kubeflow on GCP
  2. Should Anthos Service Mesh really be the default deployment option for GCP versus open source istio?

Keep in mind Anthos isn't the cheapest*. Not all use may want to include Anthos in their deployment, and may be offput by the effort required to reintegrate open source istio.

Bobgy commented 4 years ago

For clarification, Anthos Service Mesh is literally managed istio. It's not an abstraction on top. You are not tied to use ASM.

Therefore, you can replace it with istio, if you got that working, we'd welcome contribution to provide istio as an option (or default?).

And this is currently an early phase of the release, many docs are catching up. The purpose of this issue is to figure out answers to these question and make it clear in documentation too.

jlewi commented 4 years ago

On GCP we use ACM because it is the recommended way of running ISTIO on GCP. If you want to deploy and run OSS ISTiO instead you are welcome to do so.

Bobgy commented 4 years ago

Hi all! I've updated with quick answers I drafted in the first comment: https://github.com/kubeflow/gcp-blueprints/issues/123#issue-678249292.

Feel free to let me know if there are further questions.

jlewi commented 4 years ago

I would put OSS ISTIO into the bring your own infrastructure bucket.

53 is tracking installing on existing clusters. I think in this case users would be responsible for setting up a cluster with ISTIO and then continuing to install KF ontop of it.

jtfogarty commented 3 years ago

/priority p1