GoogleCloudPlatform / kubeflow-distribution

Blueprints for Deploying Kubeflow on Google Cloud Platform and Anthos
Apache License 2.0
80 stars 63 forks source link

[KF 1.3] Integrate Knative and KFServing #209

Closed zijianjoy closed 3 years ago

zijianjoy commented 3 years ago

Highlevel tracking: https://github.com/kubeflow/manifests/issues/1798

In order to prepare for GCP distribution of Kubeflow 1.3, we need to make sure kfserving installation is working. Currently we haven't installed cluster-local-gateway in our deployment step, but it is needed based on documentation in KFServing GCP/IAP Example. Based on manifest README, I ran the following command to build local-cluster-gateway:

kustomize build --load-restrictor LoadRestrictionsNone -o $(BUILD_DIR)/cluster-local-gateway ./common/istio/istio-1-9-0/cluster-local-gateway/base

The build result is in https://github.com/kubeflow/gcp-blueprints/commit/f2f05bb5c07012acfaaa31575a3235761bc24e82.

However, the cluster-local-gateway deployment is failing with the following error:

 warn Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 0 successful, 0 rejected; lds updates: 0 successful, 0 rejected
warn sds failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority"
 warn ca ca request failed, starting attempt 1 in 96.070398ms
Info
 warn ca ca request failed, starting attempt 2 in 187.488321ms
Info
 warn ca ca request failed, starting attempt 3 in 429.265926ms
Info
 warn ca ca request failed, starting attempt 4 in 878.238921ms

How should I correctly configure certificate and CA? Since I am using anthos service mesh 1.9.2-1, the pilot is not a pod but lives inside istiod. See pods as below:

$ kubectl get pod -n istio-system
NAME                                     READY   STATUS    RESTARTS   AGE
backend-updater-0                        1/1     Running   0          39h
cluster-local-gateway-6b6cb58745-qq7r2   0/1     Running   0          7h7m
iap-enabler-5864d6c776-5s5lc             1/1     Running   0          39h
istio-ingressgateway-66c8544779-78z7n    1/1     Running   0          43h
istio-ingressgateway-66c8544779-vch9t    1/1     Running   0          43h
istiod-asm-192-1-6674458687-7gh9j        1/1     Running   0          43h
istiod-asm-192-1-6674458687-wkxbx        1/1     Running   0          43h
whoami-app-7fc9f76f57-nr2lx              1/1     Running   0          39h

cc @Bobgy @yanniszark

Bobgy commented 3 years ago

@zijianjoy I previously investigated this problem in https://github.com/kubeflow/gcp-blueprints/issues/176#issuecomment-759235650.

I suggest moving this to kubeflow/gcp-blueprints repo, because cluster-local-gateway installation should be in istio-operator definition, so it will probably be Anthos Service Mesh specific.

Bobgy commented 3 years ago

I think best references for KFServing and Knative installation are their official docs, instead of manifests README.

https://github.com/kubeflow/kfserving https://knative.dev/docs/install/

From kfserving doc, it seems that it only supports usage with Istio, so we can ignore Knative installation methods not depending on istio.

For Knative, I found https://knative.dev/docs/install/knative-with-operators/ -- typically operator supports upgrade better in the long term, but Knative operator is still in Alpha phase, so we should better skip it. So better follow installation doc in https://knative.dev/docs/install/install-serving-with-yaml/#install-a-networking-layer.

Bobgy commented 3 years ago

Renamed title to make this central tracker for kfserving

zijianjoy commented 3 years ago

Thank you so much Yuan for all the helpful information!

TL;DR

I am able to install Knative serving v0.22.0 from official doc, cluster-local-cluster within ASM, and KFserving using kubeflow/manifests. Then I am able to call predict endpoint of sklearn iris inferenceService without sidecar injection. The current blocker is to call predict endpoint with sidecar injection (which means IAP enabled).

zijianjoy commented 3 years ago

Detailed integration steps

1. Cleanup existing deployments

If we have the existing deployment of knative and cluster-local-gateway, we need to clean them up first. Namely the following targets in kubeflow/manifests repo:

- knative/upstream/knative-serving-crds/base
- knative/upstream/knative-serving-install/base
- istio/istio-1-9-0/cluster-local-gateway/base

Reason: KFserving requires knative v0.17.4+, but kubeflow/manifests contain v0.14.3. And we need to use the ASM approach for cluster-local-gateway: https://github.com/kubeflow/gcp-blueprints/issues/176#issuecomment-759235650

2. Istio Customization

ASM has installed Istio for us on GCP. But I made the customization of use Istio mTLS feature by using the following steps:

  1. Apply the asm label to knative-serving namespace: kubectl label namespace knative-serving istio.io/rev=asm-192-1
  2. Set PeerAuthentication to PERMISSIVE on knative-serving system namespace: see the knative doc use Istio mTLS feature.
  3. Updating the config-istio configmap to use a non-default local gateway
    • kubectl edit configmap config-istio -n knative-serving
    • Find the field local-gateway.knative-serving.knative-local-gateway
    • Apply this content local-gateway.knative-serving.cluster-local-gateway: "cluster-local-gateway.istio-system.svc.cluster.local".
  4. Run commandkubectl edit gateway cluster-local-gateway -n knative-serving
    • Confirm that the label selector is istio: cluster-local-gateway
  5. Set the custom domain https://knative.dev/docs/serving/using-a-custom-domain/
    • Run kubectl edit cm config-domain --namespace knative-serving
    • Replace the whole _example: and its content with {KF_NAME}.endpoints.{GCP_PROJECT_ID}.cloud.goog: ""

3. Official installation of Knative Serving

The official documentation of Knative serving installation: https://knative.dev/docs/install/install-serving-with-yaml/. We need to follow a slightly different installation steps:

  1. kubectl apply -f https://github.com/knative/serving/releases/download/v0.22.0/serving-crds.yaml
  2. kubectl apply -f https://github.com/knative/serving/releases/download/v0.22.0/serving-core.yaml
  3. Istio is already installed within ASM installation.
  4. Install cluster-local-cluster by using command make install-asm-cluster-local-gateway. See the patch: https://github.com/zijianjoy/gcp-blueprints/commit/84d460053eab7580bc658924e53f5d63aafd7db6
  5. Install Knative Istio controller: kubectl label namespace knative-serving istio.io/rev=asm-192-1
  6. Verify installation kubectl get pods --namespace knative-serving
  7. I skipped the DNS configuration at https://knative.dev/docs/install/install-serving-with-yaml/#configure-dns

4. kubeflow/manifests installation of KFserving

I have previously deployed KFserving, which is in https://github.com/zijianjoy/gcp-blueprints/blob/match-upstream-kfserving/apps/kfserving/installs/generic/kustomization.yaml. (kfserving-controller:v0.5.1). This installation currently is coupled within apps/kubeflow-apps/apps/kustomization.yaml. We need to deploy KFserving if not deployed yet.

5. KFserving customization

Based on https://github.com/kubeflow/kfserving#prerequisites, we are using Knative v0.19.0+, so we need to make the following edit to inference service config:

kubectl edit configmap inferenceservice-config -n kubeflow

Old:

        "localGateway" : "cluster-local-gateway.knative-serving",
        "localGatewayService" : "cluster-local-gateway.istio-system.svc.cluster.local"

New:

        "localGateway" : "knative-local-gateway.knative-serving",
        "localGatewayService" : "knative-local-gateway.istio-system.svc.cluster.local"

At this point the installation is completed. But in order to validate the installation is accurate, we need to bring up an example inferenceService and call its endpoint.

zijianjoy commented 3 years ago

Follow ASM + IAP example

I ran the example based on https://github.com/kubeflow/kfserving/tree/master/docs/samples/gcp-iap. GCP IAP is already set up.

Create inference service

kubectl apply -f sklearn-iap-no-authz.yaml

Note that I added namespace: ${PROFILE_NAME_CREATED_BY_USER} in the sklearn-iap-no-authz.yaml. Because we shouldn't deploy inferenceService on kubeflow namespace: https://github.com/kubeflow/kfserving/#kfserving-with-kubeflow-installation. ${PROFILE_NAME_CREATED_BY_USER} is the profile name I created from kubeflow centraldashboard.

Create virtual service

Follow Expose the inference service externally using an additional Istio Virtual Service to install virtual service. But you need to use this file with your customization: https://github.com/zijianjoy/gcp-blueprints/commit/10d2983a1aebd2f4563cf46436f92a143a0cf526.

You need to replace jamxl with your ${PROFILE_NAME_CREATED_BY_USER} so it can map the URL prefix to your inferenceService local path.

kubectl apply -f exp/sample/jamxl-virtual-service.yaml

Test the external predict endpoint

Follow instruction https://github.com/kubeflow/kfserving/tree/master/docs/samples/gcp-iap#test-the-external-predict-endpoint-using-iap_requestpy to call the inferenceService endpoint. If you see prediction: [1, 1] then you are succesful.

Issue: Enable IAP for inferenceService

Add annotation sidecar.istio.io/inject: "true" to inferenceService sklearn-iap, then the issue happens with error message: The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.. I am trying to figure out a way to call predict endpoint with authentication and authorization enabled.

Possible resolution: https://github.com/kubeflow/gcp-blueprints/issues/178

Bobgy commented 3 years ago

FYI, here's KFP SDK code to authenticate for IAP: https://github.com/kubeflow/pipelines/blob/85cb99173dead8bd2ca09c8e040b137f59d00ad7/sdk/python/kfp/_auth.py#L61-L84

zijianjoy commented 3 years ago

Thank you Yuan for the reference!

I have validated that the KFServing is working properly with the correct curl request and authorizationPolicy. Further more, we can simplify the installation step from https://github.com/kubeflow/gcp-blueprints/issues/209#issuecomment-818012309:

Detailed integration steps

Istio Customization

No need for steps 2, 3, 4

Official installation of Knative Serving

No need for step 4


For the ASM IAP sample, we need to make following changes

ASM + IAP example

Virtual service is not needed, we can remove it.

AuthorizationPolicy

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: sklearn-iap
  namespace: jamxl
spec:
  action: ALLOW
  rules: 
  - {}
  selector:
    matchLabels:
      app: kfserving-app

InferenceService

  labels:
    app: kfserving-app

We need to make correct request to IAP guarded network. Will follow up later.


Action items

  1. Make all the installations into kustomize structure (Kubeflow 1.3 required)
  2. Update KFServing for IAP+ASM sample
zijianjoy commented 3 years ago

KFserving WG recommends KFserving 0.5.1, KNative 0.17.4 because they have using this configuration in production. However, the upstream kubeflow/manifests has an out-of-date KNative version 0.14, and we are very close to the release date for Kubeflow 1.3.

It has been very challenging to adopt the recommended KNative version v0.17.4, because this version has been very old and hard to install it correctly without active documentation. I spent a day but couldn't make the integration work. Given that we lack the proper support for this KNative manifest, and I am able to get the integration working with KNative 0.22.0, I am leaning towards using KNative 0.22.0 for Kubeflow 1.3 release.

Bobgy commented 3 years ago

Makes sense to me, thank you for the efforts!

zijianjoy commented 3 years ago

Makes sense to me, thank you for the efforts!

Thank you Yuan for confirming!