Closed zijianjoy closed 3 years ago
@zijianjoy I previously investigated this problem in https://github.com/kubeflow/gcp-blueprints/issues/176#issuecomment-759235650.
I suggest moving this to kubeflow/gcp-blueprints repo, because cluster-local-gateway installation should be in istio-operator definition, so it will probably be Anthos Service Mesh specific.
I think best references for KFServing and Knative installation are their official docs, instead of manifests README.
https://github.com/kubeflow/kfserving https://knative.dev/docs/install/
From kfserving doc, it seems that it only supports usage with Istio, so we can ignore Knative installation methods not depending on istio.
For Knative, I found https://knative.dev/docs/install/knative-with-operators/ -- typically operator supports upgrade better in the long term, but Knative operator is still in Alpha phase, so we should better skip it. So better follow installation doc in https://knative.dev/docs/install/install-serving-with-yaml/#install-a-networking-layer.
Renamed title to make this central tracker for kfserving
Thank you so much Yuan for all the helpful information!
I am able to install Knative serving v0.22.0 from official doc, cluster-local-cluster within ASM, and KFserving using kubeflow/manifests
. Then I am able to call predict
endpoint of sklearn iris inferenceService without sidecar injection. The current blocker is to call predict endpoint with sidecar injection (which means IAP enabled).
If we have the existing deployment of knative and cluster-local-gateway, we need to clean them up first. Namely the following targets in kubeflow/manifests
repo:
- knative/upstream/knative-serving-crds/base
- knative/upstream/knative-serving-install/base
- istio/istio-1-9-0/cluster-local-gateway/base
Reason: KFserving requires knative v0.17.4+, but kubeflow/manifests
contain v0.14.3. And we need to use the ASM approach for cluster-local-gateway: https://github.com/kubeflow/gcp-blueprints/issues/176#issuecomment-759235650
ASM has installed Istio for us on GCP. But I made the customization of use Istio mTLS feature by using the following steps:
knative-serving
namespace:
kubectl label namespace knative-serving istio.io/rev=asm-192-1
PeerAuthentication
to PERMISSIVE
on knative-serving system namespace: see the knative doc use Istio mTLS feature.local-gateway.knative-serving.knative-local-gateway
local-gateway.knative-serving.cluster-local-gateway: "cluster-local-gateway.istio-system.svc.cluster.local"
.kubectl edit gateway cluster-local-gateway -n knative-serving
istio: cluster-local-gateway
kubectl edit cm config-domain --namespace knative-serving
_example:
and its content with {KF_NAME}.endpoints.{GCP_PROJECT_ID}.cloud.goog: ""
The official documentation of Knative serving installation: https://knative.dev/docs/install/install-serving-with-yaml/. We need to follow a slightly different installation steps:
kubectl apply -f https://github.com/knative/serving/releases/download/v0.22.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/v0.22.0/serving-core.yaml
cluster-local-cluster
by using command make install-asm-cluster-local-gateway
. See the patch: https://github.com/zijianjoy/gcp-blueprints/commit/84d460053eab7580bc658924e53f5d63aafd7db6kubectl label namespace knative-serving istio.io/rev=asm-192-1
kubectl get pods --namespace knative-serving
I have previously deployed KFserving, which is in https://github.com/zijianjoy/gcp-blueprints/blob/match-upstream-kfserving/apps/kfserving/installs/generic/kustomization.yaml. (kfserving-controller:v0.5.1). This installation currently is coupled within apps/kubeflow-apps/apps/kustomization.yaml
. We need to deploy KFserving if not deployed yet.
Based on https://github.com/kubeflow/kfserving#prerequisites, we are using Knative v0.19.0+, so we need to make the following edit to inference service config:
kubectl edit configmap inferenceservice-config -n kubeflow
Old:
"localGateway" : "cluster-local-gateway.knative-serving",
"localGatewayService" : "cluster-local-gateway.istio-system.svc.cluster.local"
New:
"localGateway" : "knative-local-gateway.knative-serving",
"localGatewayService" : "knative-local-gateway.istio-system.svc.cluster.local"
At this point the installation is completed. But in order to validate the installation is accurate, we need to bring up an example inferenceService and call its endpoint.
I ran the example based on https://github.com/kubeflow/kfserving/tree/master/docs/samples/gcp-iap. GCP IAP is already set up.
kubectl apply -f sklearn-iap-no-authz.yaml
Note that I added namespace: ${PROFILE_NAME_CREATED_BY_USER}
in the sklearn-iap-no-authz.yaml. Because we shouldn't deploy inferenceService on kubeflow namespace: https://github.com/kubeflow/kfserving/#kfserving-with-kubeflow-installation. ${PROFILE_NAME_CREATED_BY_USER}
is the profile name I created from kubeflow centraldashboard.
Follow Expose the inference service externally using an additional Istio Virtual Service to install virtual service. But you need to use this file with your customization: https://github.com/zijianjoy/gcp-blueprints/commit/10d2983a1aebd2f4563cf46436f92a143a0cf526.
You need to replace jamxl
with your ${PROFILE_NAME_CREATED_BY_USER}
so it can map the URL prefix to your inferenceService local path.
kubectl apply -f exp/sample/jamxl-virtual-service.yaml
Follow instruction https://github.com/kubeflow/kfserving/tree/master/docs/samples/gcp-iap#test-the-external-predict-endpoint-using-iap_requestpy
to call the inferenceService endpoint. If you see prediction: [1, 1]
then you are succesful.
Add annotation sidecar.istio.io/inject: "true"
to inferenceService sklearn-iap
, then the issue happens with error message: The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.
. I am trying to figure out a way to call predict
endpoint with authentication and authorization enabled.
Possible resolution: https://github.com/kubeflow/gcp-blueprints/issues/178
FYI, here's KFP SDK code to authenticate for IAP: https://github.com/kubeflow/pipelines/blob/85cb99173dead8bd2ca09c8e040b137f59d00ad7/sdk/python/kfp/_auth.py#L61-L84
Thank you Yuan for the reference!
I have validated that the KFServing is working properly with the correct curl request and authorizationPolicy. Further more, we can simplify the installation step from https://github.com/kubeflow/gcp-blueprints/issues/209#issuecomment-818012309:
No need for steps 2, 3, 4
No need for step 4
For the ASM IAP sample, we need to make following changes
Virtual service is not needed, we can remove it.
AuthorizationPolicy
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: sklearn-iap
namespace: jamxl
spec:
action: ALLOW
rules:
- {}
selector:
matchLabels:
app: kfserving-app
InferenceService
labels:
app: kfserving-app
We need to make correct request to IAP guarded network. Will follow up later.
KFserving WG recommends KFserving 0.5.1, KNative 0.17.4 because they have using this configuration in production. However, the upstream kubeflow/manifests
has an out-of-date KNative version 0.14, and we are very close to the release date for Kubeflow 1.3.
It has been very challenging to adopt the recommended KNative version v0.17.4, because this version has been very old and hard to install it correctly without active documentation. I spent a day but couldn't make the integration work. Given that we lack the proper support for this KNative manifest, and I am able to get the integration working with KNative 0.22.0, I am leaning towards using KNative 0.22.0 for Kubeflow 1.3 release.
Makes sense to me, thank you for the efforts!
Makes sense to me, thank you for the efforts!
Thank you Yuan for confirming!
Highlevel tracking: https://github.com/kubeflow/manifests/issues/1798
In order to prepare for GCP distribution of Kubeflow 1.3, we need to make sure kfserving installation is working. Currently we haven't installed cluster-local-gateway in our deployment step, but it is needed based on documentation in KFServing GCP/IAP Example. Based on manifest README, I ran the following command to build local-cluster-gateway:
The build result is in https://github.com/kubeflow/gcp-blueprints/commit/f2f05bb5c07012acfaaa31575a3235761bc24e82.
However, the cluster-local-gateway deployment is failing with the following error:
How should I correctly configure certificate and CA? Since I am using anthos service mesh 1.9.2-1, the pilot is not a pod but lives inside istiod. See pods as below:
cc @Bobgy @yanniszark