awslabs / kubeflow-manifests

KubeFlow on AWS
https://awslabs.github.io/kubeflow-manifests/
Apache License 2.0
165 stars 119 forks source link

Logs not showing for Inference Service #145

Closed AlexandreBrown closed 1 year ago

AlexandreBrown commented 2 years ago

Describe the bug Model Web app logs are not showing : image Browser console is showing that a request to get the logs is returning a 500 in a loop: image image

Steps To Reproduce

Expected behavior

Environment

rrrkharse commented 2 years ago

Will repro this and see what is going on

rrrkharse commented 2 years ago

Hi @AlexandreBrown, surprisingly I am unable to reproduce the issue so far.

My setup:

I was able to view logs:

Screen Shot 2022-03-24 at 1 58 18 PM

However, I have a few questions:

Additional debugging:

There are more logs in kfserving-models-web-app pod. For me, I was able to describe the pod via:

kubectl logs -n kubeflow kfserving-models-web-app-7884f597cf-cllp5

In the same window where I am logged into the central dashboard as a user user1, I opened a new tab and pasted the following url https://kubeflow.kfissue145.rkharse.people.aws.dev/models/api/namespaces/user1/inferenceservices/sklearn-iris?logs=true&component=predictor.

Broken down this url is: https://<public dns endpoint>/models/api/namespaces/<user profile namespace>/inferenceservices/<model name>?logs=true&component=predictor

Meanwhile I tailed the logs

kubectl logs -n kubeflow kfserving-models-web-app-7884f597cf-cllp5 -f

and got the following response:

2022-03-24 21:33:37,422 | kubeflow.kubeflow.crud_backend.authn | INFO | Handling request for user: User1@rkharse.com
2022-03-24 21:33:37,422 | kubeflow.kubeflow.crud_backend.csrf | INFO | Skipping CSRF check for safe method: GET
2022-03-24 21:33:37,445 | apps.common.routes.get | INFO | ['predictor']
2022-03-24 21:33:37,511 | root | INFO | Component pods: {'predictor': ['sklearn-iris-predictor-default-00001-deployment-58f94dc4946sb55']}
127.0.0.1 - - [24/Mar/2022:21:33:37 +0000] "GET /api/namespaces/user1/inferenceservices/sklearn-iris?logs=true&component=predictor HTTP/1.1" 200 6137 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Firefox/91.0"

In your case I believe there will be more details here and a 500 error.

I will continue trying to repro but feel free to let me know if this helped uncover any additional details.

AlexandreBrown commented 2 years ago

@rrrkharse

With regards to the kfserving setup you mentioned in Steps to Reproduce which setup steps are your referring to?

I am referring to this guide

If you have run the setup https://github.com/awslabs/kubeflow-manifests/issues/82 did you also apply the authorization policy configurations you https://github.com/awslabs/kubeflow-manifests/issues/82#issuecomment-1068641378?

Yes I have, I also tried re-installing Kubeflow multiple times, didnt fix it.

Are you able to view the inference pod in the profile's namespace?

Will check this as well as your other questions and get back to you.

I tried visiting my https:///models/api/namespaces//inferenceservices/?logs=true&component=predictor

and the output was :

{
  "log": "An error occured in the backend.", 
  "status": 500, 
  "success": false, 
  "user": "my-email"
}

Here is the logs of the network tab on chrome
image

Looks like only the logs is causing failure, the model is recognized as healthy.

Also, I am using Kubeflow without port-forward and I've setup the grafana metrics and that works, it's just the Logs that do not work for some odd reason.

AlexandreBrown commented 2 years ago

@rrrkharse @goswamig @surajkota @ryansteakley

Hello, could we get some help on this issue?

Command :

kubectl get inferenceservice -n dev

Output :

NAME                  URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                           AGE
sklearn-iris          http://sklearn-iris.dev.ai-rd.mdinfinity.com/          True           100                              sklearn-iris-predictor-default-00001          3m35s  

Command :

kubectl get pods --all-namespaces

Output :

NAMESPACE            NAME                                                              READY   STATUS    RESTARTS   AGE
dev                  sklearn-iris-predictor-default-00001-deployment-5558697cb-t8q7b   3/3     Running   0          3m32s

(All other pods running)

Command :

kubectl logs -n dev sklearn-iris-predictor-default-00001-deployment-5558697cb-t8q7b -c kfserving-container

Output :

[I 220404 23:46:49 storage:50] Copying contents of /mnt/models to local
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:35: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:597: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:836: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:862: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, positive=False):
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:1097: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:1344: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/least_angle.py:1480: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, positive=False):
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/randomized_l1.py:152: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  precompute=False, eps=np.finfo(np.float).eps,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/randomized_l1.py:320: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, random_state=None,
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/randomized_l1.py:580: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=4 * np.finfo(np.float).eps, n_jobs=None,
[I 220404 23:46:49 kfserver:151] Registering model: sklearn-iris
[I 220404 23:46:49 kfserver:121] Setting asyncio max_workers as 5
[I 220404 23:46:49 kfserver:128] Listening on port 8080
[I 220404 23:46:49 kfserver:130] Will fork 1 workers

Command :

kubectl logs -n kubeflow kfserving-models-web-app-7884f597cf-bgpk4 --tail=30

Output :

Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/kfserving-models-web-app-7884f597cf-bgpk4. Please use `kubectl.kubernetes.io/default-container` instead
2022-04-05 00:20:26,660 | kubeflow.kubeflow.crud_backend.authn | INFO | Handling request for user: myemail@mycompany.com
2022-04-05 00:20:26,660 | kubeflow.kubeflow.crud_backend.csrf | INFO | Skipping CSRF check for safe method: GET
127.0.0.1 - - [05/Apr/2022:00:20:26 +0000] "GET /api/namespaces/dev/configurations/sklearn-iris-predictor-default HTTP/1.1" 200 5691 "https://our_kubeflow_url.com/models/details/dev/sklearn-iris" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
2022-04-05 00:20:26,731 | kubeflow.kubeflow.crud_backend.authn | INFO | Handling request for user: myemail@mycompany.com
2022-04-05 00:20:26,731 | kubeflow.kubeflow.crud_backend.csrf | INFO | Skipping CSRF check for safe method: GET
127.0.0.1 - - [05/Apr/2022:00:20:26 +0000] "GET /api/namespaces/dev/knativeServices/sklearn-iris-predictor-default HTTP/1.1" 200 6253 "https://our_kubeflow_url.com/models/details/dev/sklearn-iris" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
2022-04-05 00:20:26,816 | kubeflow.kubeflow.crud_backend.authn | INFO | Handling request for user: myemail@mycompany.com
2022-04-05 00:20:26,816 | kubeflow.kubeflow.crud_backend.csrf | INFO | Skipping CSRF check for safe method: GET
127.0.0.1 - - [05/Apr/2022:00:20:26 +0000] "GET /api/namespaces/dev/routes/sklearn-iris-predictor-default HTTP/1.1" 200 4226 "https://our_kubeflow_url.com/models/details/dev/sklearn-iris" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
2022-04-05 00:20:27,530 | kubeflow.kubeflow.crud_backend.authn | INFO | Handling request for user: myemail@mycompany.com
2022-04-05 00:20:27,530 | kubeflow.kubeflow.crud_backend.csrf | INFO | Skipping CSRF check for safe method: GET
2022-04-05 00:20:27,542 | apps.common.routes.get | INFO | ['predictor']
2022-04-05 00:20:27,557 | kubeflow.kubeflow.crud_backend.errors.handlers | ERROR | Caught and unhandled Exception!
2022-04-05 00:20:27,557 | kubeflow.kubeflow.crud_backend.errors.handlers | ERROR | argument of type 'NoneType' is not iterable
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/src/apps/common/routes/get.py", line 27, in get_inference_service
    "serviceLogs", get_inference_service_logs(inference_service)
  File "/src/apps/common/routes/get.py", line 40, in get_inference_service_logs
    component_pods_dict = utils.get_inference_service_pods(svc, components)
  File "/src/apps/common/utils.py", line 42, in get_inference_service_pods
    if KNATIVE_REVISION_LABEL not in pod.metadata.labels:
TypeError: argument of type 'NoneType' is not iterable
127.0.0.1 - - [05/Apr/2022:00:20:27 +0000] "GET /api/namespaces/dev/inferenceservices/sklearn-iris?logs=true&component=predictor HTTP/1.1" 500 133 "https://our_kubeflow_url.com/models/details/dev/sklearn-iris" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
2022-04-05 00:20:30,615 | kubeflow.kubeflow.crud_backend.authn | INFO | Handling request for user: myemail@mycompany.com
2022-04-05 00:20:30,615 | kubeflow.kubeflow.crud_backend.csrf | INFO | Skipping CSRF check for safe method: GET
127.0.0.1 - - [05/Apr/2022:00:20:30 +0000] "GET /api/namespaces/dev/inferenceservices HTTP/1.1" 200 4740 "https://our_kubeflow_url.com/models/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"

This last output seems to be where things went wrong.

if KNATIVE_REVISION_LABEL not in pod.metadata.labels:
TypeError: argument of type 'NoneType' is not iterable
AlexandreBrown commented 2 years ago

Perhaps this might be an issue with RoleBinding ?

Here is how we created namespaces and added users to it :

  1. Create a profile/namespace
    apiVersion: kubeflow.org/v1beta1
    kind: Profile
    metadata:
    name: dev
  2. Create default-editor service account
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
    name: default-editor-access-binding-dev
    subjects:
    - kind: ServiceAccount
    name: default-editor
    namespace: dev
    roleRef:
    kind: ClusterRole
    name: cluster-admin
    apiGroup: rbac.authorization.k8s.io
  3. Apply the RoleBinding for user (taken from kubeflow doc)
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
    annotations:
    role: edit
    user: userid@email.com   # replace with the email of the user from your Active Directory case sensitive
    name: user-userid-email-com-clusterrole-edit
    # Ex: if the user email is lalith.vaka@kp.org the name should be user-lalith-vaka-kp-org-clusterrole-edit
    # Note: if the user email is Lalith.Vaka@kp.org from your Active Directory, the name should be user-lalith-vaka-kp-org-clusterrole-edit
    namespace: profileName # replace with the namespace/profile name that you are adding contributors to
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: kubeflow-edit
    subjects:
    - apiGroup: rbac.authorization.k8s.io
    kind: User
    name: userid@email.com   # replace with the email of the user from your Active Directory case sensitive
  4. Apply Auth policy (again, taken from kubeflow doc)
    apiVersion: security.istio.io/v1beta1
    kind: AuthorizationPolicy
    metadata:
    annotations:
    role: edit
    user: userid@email.com # replace with the email of the user from your Active Directory case sensitive
    name: user-userid-email-com-clusterrole-edit
    namespace: profileName # replace with the namespace/profile name that you are adding contributors to
    spec:
    action: ALLOW
    rules:
    - when:
    - key: request.headers[kubeflow-userid] # for GCP, use x-goog-authenticated-user-email instead of kubeflow-userid for authentication purpose
      values:
      - accounts.google.com:userid@email.com   # replace with the email of the user from your Active Directory case sensitive

The values for steps 3 and 4 are set according to the email of the user, profile etc. Everything besides model server logs work (notebook creation, pipeline, PVC etc)

goswamig commented 2 years ago

Hey, the error in kfserving-models-web-app-7884f597cf-bgpk4 is not related to RBAC but something else.

Here is the code which does that check https://github.com/kserve/models-web-app/blob/b842420d45f0bcf596a219efc116e8c66a727683/backend/apps/common/utils.py#L47

Can you describe the pod and paste the output?

AlexandreBrown commented 2 years ago

@goswamig Command :

kubectl describe pod kfserving-models-web-app-7884f597cf-bgpk4 -n kubeflow

Output :

Name:         kfserving-models-web-app-7884f597cf-bgpk4
Namespace:    kubeflow
Priority:     0
Node:         ip-xxx-xxx-xx-xxx.us-east-2.compute.internal/xxx.xxx.xx.xxx
Start Time:   Tue, 22 Mar 2022 19:25:52 +0000
Labels:       app=kfserving
              app.kubernetes.io/component=kfserving-models-web-app
              app.kubernetes.io/name=kfserving
              istio.io/rev=default
              kustomize.component=kfserving-models-web-app
              pod-template-hash=7884f597cf
              security.istio.io/tlsMode=istio
              service.istio.io/canonical-name=kfserving
              service.istio.io/canonical-revision=latest
Annotations:  kubectl.kubernetes.io/default-logs-container: kfserving-models-web-app
              kubernetes.io/psp: eks.privileged
              prometheus.io/path: /stats/prometheus
              prometheus.io/port: 15020
              prometheus.io/scrape: true
              sidecar.istio.io/inject: true
              sidecar.istio.io/status:
                {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-...
Status:       Running
IP:           xxx.xxx.xx.xxx
IPs:
  IP:           xxx.xxx.xx.xxx
Controlled By:  ReplicaSet/kfserving-models-web-app-7884f597cf
Init Containers:
  istio-init:
    Container ID:  docker://ac728ab75e1594e0f1a21eea389aa2a5b66b76d985f54c789fc8795672e95680
    Image:         docker.io/istio/proxyv2:1.9.6
    Image ID:      docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      *
      -d
      15090,15021,15020
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 22 Mar 2022 19:25:55 +0000
      Finished:     Tue, 22 Mar 2022 19:25:55 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:        10m
      memory:     40Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xl7cg (ro)
Containers:
  kfserving-models-web-app:
    Container ID:   docker://e6b80d44ba0342e862c21a4aed4aa2903505cc534b716b9e200e72432e835283
    Image:          kfserving/models-web-app:v0.6.1
    Image ID:       docker-pullable://kfserving/models-web-app@sha256:f322d6ffdfa3b5a6fbbe3910c4cf4b5ad85871f20dbd2ed854d65f90e1fd8aba
    Port:           5000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 22 Mar 2022 19:25:56 +0000
    Ready:          True
    Restart Count:  0
    Environment Variables from:
      kfserving-models-web-app-config-mtgm8bbd98  ConfigMap  Optional: false
    Environment:                                  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xl7cg (ro)
  istio-proxy:
    Container ID:  docker://839a6c4dec08563abb79edcd8c7a5da89d3404292c8c7fdff2c0922b79540948
    Image:         docker.io/istio/proxyv2:1.9.6
    Image ID:      docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --serviceCluster
      kfserving.$(POD_NAMESPACE)
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --concurrency
      2
    State:          Running
      Started:      Tue, 22 Mar 2022 19:25:57 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      10m
      memory:   40Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
    Environment:
      JWT_POLICY:                    third-party-jwt
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      kfserving-models-web-app-7884f597cf-bgpk4 (v1:metadata.name)
      POD_NAMESPACE:                 kubeflow (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      CANONICAL_SERVICE:              (v1:metadata.labels['service.istio.io/canonical-name'])
      CANONICAL_REVISION:             (v1:metadata.labels['service.istio.io/canonical-revision'])
      PROXY_CONFIG:                  {"tracing":{}}

      ISTIO_META_POD_PORTS:          [
                                         {"containerPort":5000,"protocol":"TCP"}
                                     ]
      ISTIO_META_APP_CONTAINERS:     kfserving-models-web-app
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_METAJSON_ANNOTATIONS:    {"kubernetes.io/psp":"eks.privileged","sidecar.istio.io/inject":"true"}

      ISTIO_META_WORKLOAD_NAME:      kfserving-models-web-app
      ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/kubeflow/deployments/kfserving-models-web-app
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xl7cg (ro)
      /var/run/secrets/tokens from istio-token (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
      limits.cpu -> cpu-limit
      requests.cpu -> cpu-request
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  kube-api-access-xl7cg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

I also tried deleting the pod so it recreates it but still same result, models web app pod is running but no logs shown.

AlexandreBrown commented 2 years ago

Here is the description of an inference pod

apiVersion: "serving.kubeflow.org/v1beta1"
kind: "InferenceService"
metadata:
  name: "torchscript-cifar"
spec:
  predictor:
    triton:
      storageUri: "gs://kfserving-examples/models/torchscript"
      runtimeVersion: 21.08-py3
      env:
      - name: OMP_NUM_THREADS
        value: "1"

describe :

Name:         torchscript-cifar-predictor-default-00001-deployment-579fdpzzbj
Namespace:    dev
Priority:     0
Node:         ip-xxx-xxx-xx-xxx.us-xxxx-x.compute.internal/xxx.xxx.xx.xxx
Start Time:   Wed, 06 Apr 2022 12:47:26 +0000
Labels:       app=torchscript-cifar-predictor-default-00001
              component=predictor
              istio.io/rev=default
              pod-template-hash=579fdc9658
              security.istio.io/tlsMode=istio
              service.istio.io/canonical-name=torchscript-cifar-predictor-default
              service.istio.io/canonical-revision=torchscript-cifar-predictor-default-00001
              serving.knative.dev/configuration=torchscript-cifar-predictor-default
              serving.knative.dev/configurationGeneration=1
              serving.knative.dev/configurationUID=bc0ebece-d60b-4e50-a830-1bc893088a4a
              serving.knative.dev/revision=torchscript-cifar-predictor-default-00001
              serving.knative.dev/revisionUID=e3bb9a41-5ab1-42fe-9955-5cbbf6d9b609
              serving.knative.dev/service=torchscript-cifar-predictor-default
              serving.knative.dev/serviceUID=25b5d802-3a0e-4a6f-8b64-aa70bf0d89a3
              serving.kubeflow.org/inferenceservice=torchscript-cifar
Annotations:  autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
              autoscaling.knative.dev/minScale: 1
              internal.serving.kubeflow.org/storage-initializer-sourceuri: gs://kfserving-examples/models/torchscript
              kubernetes.io/psp: eks.privileged
              prometheus.io/path: /stats/prometheus
              prometheus.io/port: 15020
              prometheus.io/scrape: true
              serving.knative.dev/creator: system:serviceaccount:kubeflow:default
              sidecar.istio.io/status:
                {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-...
Status:       Running
IP:           xxx.xxx.xx.xxx
IPs:
  IP:           xxx.xxx.xx.xxx
Controlled By:  ReplicaSet/torchscript-cifar-predictor-default-00001-deployment-579fdc9658
Init Containers:
  storage-initializer:
    Container ID:  docker://d55da9dde917af69abcaadaa28464351257bb51b0b7c3f133e7913c12b294eb4
    Image:         kfserving/storage-initializer:v0.6.1
    Image ID:      docker-pullable://kfserving/storage-initializer@sha256:81b654574aec1dbce37d6e5b8d5d58b5d820fdf17e35b04cd52b1c805595fc37
    Port:          <none>
    Host Port:     <none>
    Args:
      gs://kfserving-examples/models/torchscript
      /mnt/models
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 06 Apr 2022 12:47:27 +0000
      Finished:     Wed, 06 Apr 2022 12:47:28 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:        100m
      memory:     100Mi
    Environment:  <none>
    Mounts:
      /mnt/models from kfserving-provision-location (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j5h5h (ro)
  istio-init:
    Container ID:  docker://5ec712c6062b13d3517d600395dbce6e178fa9fe6775b0a4014159be96596f7d
    Image:         docker.io/istio/proxyv2:1.9.6
    Image ID:      docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      *
      -d
      15090,15021,15020
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 06 Apr 2022 12:47:29 +0000
      Finished:     Wed, 06 Apr 2022 12:47:29 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:        10m
      memory:     40Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j5h5h (ro)
Containers:
  kfserving-container:
    Container ID:  docker://94d0a87961741aae936b52f45c709a86f7c99dd78359b761009b9bf8d5b1eaed
    Image:         nvcr.io/nvidia/tritonserver@sha256:6a1ef84e93327fef53ac11379fa4e53082c9e228de5db24e0abdfc5aff5676cf
    Image ID:      docker-pullable://nvcr.io/nvidia/tritonserver@sha256:6a1ef84e93327fef53ac11379fa4e53082c9e228de5db24e0abdfc5aff5676cf
    Port:          8080/TCP
    Host Port:     0/TCP
    Args:
      tritonserver
      --model-store=/mnt/models
      --grpc-port=9000
      --http-port=8080
      --allow-grpc=true
      --allow-http=true
    State:          Running
      Started:      Wed, 06 Apr 2022 12:50:37 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:     1
      memory:  2Gi
    Environment:
      OMP_NUM_THREADS:  1
      PORT:             8080
      K_REVISION:       torchscript-cifar-predictor-default-00001
      K_CONFIGURATION:  torchscript-cifar-predictor-default
      K_SERVICE:        torchscript-cifar-predictor-default
    Mounts:
      /mnt/models from kfserving-provision-location (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j5h5h (ro)
  queue-proxy:
    Container ID:   docker://088fcd1e9e419ea6878f4140bca29e094a038f02b2e77a96636453f022476ea2
    Image:          gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:0b8e031170354950f3395876961452af1c62f7ab5161c9e71[8673](https://gitlab.com/geown/artificial-intelligence/applications/kubeflow/-/jobs/2299055041#L8673)92c11881962
    Image ID:       docker-pullable://gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:0b8e031170354950f3395876961452af1c62f7ab5161c9e71867392c11881962
    Ports:          8022/TCP, 9090/TCP, 9091/TCP, 8012/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Wed, 06 Apr 2022 12:50:37 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      25m
    Readiness:  http-get http://:15020/app-health/queue-proxy/readyz delay=0s timeout=1s period=1s #success=1 #failure=3
    Startup:    exec [/ko-app/queue -probe-timeout 10m0s] delay=0s timeout=600s period=1s #success=1 #failure=1
    Environment:
      SERVING_NAMESPACE:                      dev
      SERVING_SERVICE:                        torchscript-cifar-predictor-default
      SERVING_CONFIGURATION:                  torchscript-cifar-predictor-default
      SERVING_REVISION:                       torchscript-cifar-predictor-default-00001
      QUEUE_SERVING_PORT:                     8012
      CONTAINER_CONCURRENCY:                  0
      REVISION_TIMEOUT_SECONDS:               300
      SERVING_POD:                            torchscript-cifar-predictor-default-00001-deployment-579fdpzzbj (v1:metadata.name)
      SERVING_POD_IP:                          (v1:status.podIP)
      SERVING_LOGGING_CONFIG:                 
      SERVING_LOGGING_LEVEL:                  
      SERVING_REQUEST_LOG_TEMPLATE:           {"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header "X-B3-Traceid"}}"}
      SERVING_ENABLE_REQUEST_LOG:             false
      SERVING_REQUEST_METRICS_BACKEND:        prometheus
      TRACING_CONFIG_BACKEND:                 none
      TRACING_CONFIG_ZIPKIN_ENDPOINT:         
      TRACING_CONFIG_STACKDRIVER_PROJECT_ID:  
      TRACING_CONFIG_DEBUG:                   false
      TRACING_CONFIG_SAMPLE_RATE:             0.1
      USER_PORT:                              8080
      SYSTEM_NAMESPACE:                       knative-serving
      METRICS_DOMAIN:                         knative.dev/internal/serving
      SERVING_READINESS_PROBE:                {"tcpSocket":{"port":8080,"host":"127.0.0.1"},"successThreshold":1}
      ENABLE_PROFILING:                       false
      SERVING_ENABLE_PROBE_REQUEST_LOG:       false
      METRICS_COLLECTOR_ADDRESS:              
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j5h5h (ro)
  istio-proxy:
    Container ID:  docker://671ba8badeeb542c45bebb70d2f6904a8c3c568279bb5d36cddd25fa1803b6b0
    Image:         docker.io/istio/proxyv2:1.9.6
    Image ID:      docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --serviceCluster
      torchscript-cifar-predictor-default-00001.$(POD_NAMESPACE)
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --concurrency
      2
    State:          Running
      Started:      Wed, 06 Apr 2022 12:50:38 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      10m
      memory:   40Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
    Environment:
      JWT_POLICY:                    third-party-jwt
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      torchscript-cifar-predictor-default-00001-deployment-579fdpzzbj (v1:metadata.name)
      POD_NAMESPACE:                 dev (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      CANONICAL_SERVICE:              (v1:metadata.labels['service.istio.io/canonical-name'])
      CANONICAL_REVISION:             (v1:metadata.labels['service.istio.io/canonical-revision'])
      PROXY_CONFIG:                  {"tracing":{}}

      ISTIO_META_POD_PORTS:          [
                                         {"name":"user-port","containerPort":8080,"protocol":"TCP"}
                                         ,{"name":"http-queueadm","containerPort":8022,"protocol":"TCP"}
                                         ,{"name":"http-autometric","containerPort":9090,"protocol":"TCP"}
                                         ,{"name":"http-usermetric","containerPort":9091,"protocol":"TCP"}
                                         ,{"name":"queue-port","containerPort":8012,"protocol":"TCP"}
                                     ]
      ISTIO_META_APP_CONTAINERS:     kfserving-container,queue-proxy
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_METAJSON_ANNOTATIONS:    {"autoscaling.knative.dev/class":"kpa.autoscaling.knative.dev","autoscaling.knative.dev/minScale":"1","internal.serving.kubeflow.org/storage-initializer-sourceuri":"gs://kfserving-examples/models/torchscript","kubernetes.io/psp":"eks.privileged","serving.knative.dev/creator":"system:serviceaccount:kubeflow:default"}

      ISTIO_META_WORKLOAD_NAME:      torchscript-cifar-predictor-default-00001-deployment
      ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/dev/deployments/torchscript-cifar-predictor-default-00001-deployment
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
      ISTIO_KUBE_APP_PROBERS:        {"/app-health/queue-proxy/readyz":{"httpGet":{"path":"/","port":8012,"scheme":"HTTP","httpHeaders":[{"name":"K-Network-Probe","value":"queue"}]},"timeoutSeconds":1}}
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-j5h5h (ro)
      /var/run/secrets/tokens from istio-token (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
      limits.cpu -> cpu-limit
      requests.cpu -> cpu-request
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  kube-api-access-j5h5h:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
  kfserving-provision-location:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  4m20s  default-scheduler  Successfully assigned dev/torchscript-cifar-predictor-default-00001-deployment-579fdpzzbj to ip-xxx.xxx.xx.xxx.us-xxxx-x.compute.internal
  Normal  Pulled     4m19s  kubelet            Container image "kfserving/storage-initializer:v0.6.1" already present on machine
  Normal  Created    4m19s  kubelet            Created container storage-initializer
  Normal  Started    4m19s  kubelet            Started container storage-initializer
  Normal  Pulling    4m17s  kubelet            Pulling image "docker.io/istio/proxyv2:1.9.6"
  Normal  Pulled     4m17s  kubelet            Successfully pulled image "docker.io/istio/proxyv2:1.9.6" in 223.710995ms
  Normal  Created    4m17s  kubelet            Created container istio-init
  Normal  Started    4m17s  kubelet            Started container istio-init
  Normal  Pulling    4m16s  kubelet            Pulling image "nvcr.io/nvidia/tritonserver@sha256:6a1ef84e93327fef53ac11379fa4e53082c9e228de5db24e0abdfc5aff5676cf"
  Normal  Pulled     85s    kubelet            Successfully pulled image "nvcr.io/nvidia/tritonserver@sha256:6a1ef84e93327fef53ac11379fa4e53082c9e228de5db24e0abdfc5aff5676cf" in 2m51.7544345s
  Normal  Created    69s    kubelet            Created container kfserving-container
  Normal  Started    69s    kubelet            Started container kfserving-container
  Normal  Pulled     69s    kubelet            Container image "gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:0b8e031170354950f339562" already present on machine
  Normal  Created    69s    kubelet            Created container queue-proxy
  Normal  Started    69s    kubelet            Started container queue-proxy
  Normal  Pulling    69s    kubelet            Pulling image "docker.io/istio/proxyv2:1.9.6"
  Normal  Pulled     68s    kubelet            Successfully pulled image "docker.io/istio/proxyv2:1.9.6" in 232.847416ms
  Normal  Created    68s    kubelet            Created container istio-proxy
  Normal  Started    68s    kubelet            Started container istio-proxy
goswamig commented 2 years ago

@rrrkharse Can you share your output from describe of webapp pod ? and also kserving version that is deployed in 1.4.1 ?

rrrkharse commented 2 years ago

Pod description

$kbl describe pod -n kubeflow kfserving-models-web-app-7884f597cf-cllp5

Name:         kfserving-models-web-app-7884f597cf-cllp5
Namespace:    kubeflow
Priority:     0
Node:         ip-192-168-29-17.us-west-2.compute.internal/192.168.29.17
Start Time:   Wed, 23 Mar 2022 14:47:12 -0700
Labels:       app=kfserving
              app.kubernetes.io/component=kfserving-models-web-app
              app.kubernetes.io/name=kfserving
              istio.io/rev=default
              kustomize.component=kfserving-models-web-app
              pod-template-hash=7884f597cf
              security.istio.io/tlsMode=istio
              service.istio.io/canonical-name=kfserving
              service.istio.io/canonical-revision=latest
Annotations:  kubectl.kubernetes.io/default-logs-container: kfserving-models-web-app
              kubernetes.io/psp: eks.privileged
              prometheus.io/path: /stats/prometheus
              prometheus.io/port: 15020
              prometheus.io/scrape: true
              sidecar.istio.io/inject: true
              sidecar.istio.io/status:
                {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-...
Status:       Running
IP:           192.168.27.173
IPs:
  IP:           192.168.27.173
Controlled By:  ReplicaSet/kfserving-models-web-app-7884f597cf
Init Containers:
  istio-init:
    Container ID:  docker://487be96f9cc0d464ca525119b8d086dfa9f372c6228330431ce0477aaaebd522
    Image:         docker.io/istio/proxyv2:1.9.6
    Image ID:      docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      *
      -d
      15090,15021,15020
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 26 Mar 2022 19:03:54 -0700
      Finished:     Sat, 26 Mar 2022 19:03:54 -0700
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:        10m
      memory:     40Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lsqmk (ro)
Containers:
  kfserving-models-web-app:
    Container ID:   docker://f797d131d45fd0a0ac48337695c2ccefbb1458bd2c24aaaa8fe0bb20383e8e8c
    Image:          kfserving/models-web-app:v0.6.1
    Image ID:       docker-pullable://kfserving/models-web-app@sha256:f322d6ffdfa3b5a6fbbe3910c4cf4b5ad85871f20dbd2ed854d65f90e1fd8aba
    Port:           5000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sat, 26 Mar 2022 19:03:56 -0700
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Wed, 23 Mar 2022 14:47:26 -0700
      Finished:     Sat, 26 Mar 2022 19:01:54 -0700
    Ready:          True
    Restart Count:  1
    Environment Variables from:
      kfserving-models-web-app-config-mtgm8bbd98  ConfigMap  Optional: false
    Environment:                                  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lsqmk (ro)
  istio-proxy:
    Container ID:  docker://bd1285c601c575b262a649855d605aa38faac634e4e1dd5bbe9f66bb8d700f2f
    Image:         docker.io/istio/proxyv2:1.9.6
    Image ID:      docker-pullable://istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --serviceCluster
      kfserving.$(POD_NAMESPACE)
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --concurrency
      2
    State:          Running
      Started:      Sat, 26 Mar 2022 19:03:57 -0700
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Wed, 23 Mar 2022 14:47:27 -0700
      Finished:     Sat, 26 Mar 2022 19:01:54 -0700
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      10m
      memory:   40Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
    Environment:
      JWT_POLICY:                    third-party-jwt
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      kfserving-models-web-app-7884f597cf-cllp5 (v1:metadata.name)
      POD_NAMESPACE:                 kubeflow (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      CANONICAL_SERVICE:              (v1:metadata.labels['service.istio.io/canonical-name'])
      CANONICAL_REVISION:             (v1:metadata.labels['service.istio.io/canonical-revision'])
      PROXY_CONFIG:                  {"tracing":{}}

      ISTIO_META_POD_PORTS:          [
                                         {"containerPort":5000,"protocol":"TCP"}
                                     ]
      ISTIO_META_APP_CONTAINERS:     kfserving-models-web-app
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_METAJSON_ANNOTATIONS:    {"kubernetes.io/psp":"eks.privileged","sidecar.istio.io/inject":"true"}

      ISTIO_META_WORKLOAD_NAME:      kfserving-models-web-app
      ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/kubeflow/deployments/kfserving-models-web-app
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lsqmk (ro)
      /var/run/secrets/tokens from istio-token (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
      limits.cpu -> cpu-limit
      requests.cpu -> cpu-request
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  kube-api-access-lsqmk:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>
rrrkharse commented 2 years ago

Kserve version from this commit: https://github.com/kubeflow/manifests/tree/ca48b5eb6b77e74c957bf4522e47ae82ed78cc4a/apps/kfserving/upstream/overlays/kubeflow

AlexandreBrown commented 2 years ago

Any update @goswamig @rrrkharse ?

goswamig commented 2 years ago

Hey @AlexandreBrown , so what @rrrkharse has running has different UI than what you're running.

If you noticed in @rrrkharse he does not have Metrics tab. Are you running the same version as @rrrkharse with same instructions. ?

AlexandreBrown commented 2 years ago

@goswamig I am running the same setup. The metrics tab is available once you also setup Grafana. In our previous installation without cognito (1.4.1 manifest, the Grafana setup didn't interfere with the logs tab therefore I doubt this is the issue). https://www.kubeflow.org/docs/external-add-ons/kserve/webapp/#metrics

  1. Download the monitoring-core.yaml and monitoring-metrics-prometheus.yaml files from the Knative 0.18 release (https://github.com/knative/serving/releases/tag/v0.18.0)

  2. kubectl apply -f monitoring-core.yaml

  3. kubectl apply -f monitoring-metrics-prometheus.yaml

  4. kubectl apply -f grafana-virtual-service.yaml

  5. kubectl apply -f grafana-config-map.yaml

  6. kubectl apply -f models-web-app-aut-policy.yaml

  7. kubectl rollout restart deployment/grafana -n knative-monitoring

AlexandreBrown commented 2 years ago

@goswamig @rrrkharse I am using this commit from the aws main branch.

AWS_MANIFESTS_BUILD=038015b47104dcff5b0c596893512e4e75f5a7bd

That's maybe the only difference I could see.

Could you try to reproduce on this aws main branch commit? This would help a lot troubleshoot the source of the issue.

goswamig commented 2 years ago

@AlexandreBrown, thanks for providing more details, we will try to reproduce it. Sorry for closing the issue inadvertently.

goswamig commented 2 years ago

Is there a way u can delete and re-deploy the web-app deployment? I tried today and couldn't reproduce it.

AlexandreBrown commented 2 years ago

I deleted the deployment and re-launch the kubeflow install so it recreates it, still same issue :(

goswamig commented 2 years ago

do u still have same error in web-deployment pod ? Can u send the full log from web-deployment again?

I believe you can see the log of in console using kubectl logs .... but its failing to load in web-app, right?. So I am suspecting that web-app pod is bailing out in your case with missing pod level at least that was the error in previous log which you've posted.

AlexandreBrown commented 2 years ago

@goswamig Yes still the same error from the kfserving model web app.
Here is the logs (last 30 lines of the logs) logs.txt

I believe you can see the log of in console using kubectl logs .... but its failing to load in web-app, right?

Exactly, I can see the InferenceService logs when using Kubectl but not via the model web app UI.

image

image ...
image

EDIT : I'm not getting any logs update when doing a request. What I mean is the logs I see is the logs when the model server was first spawned but when sending request to the model server, kubectl logs ... -c kfserving-container doesn't show anything new.

Basically what I'm trying to say is that the kubectl logs of the kfserving-container is not updated so maybe it's not working via the CLI afterall (meaning the issues is not just from the model app UI perhaps).

kubectl logs ... -c istio-proxy shows logs about requests (like timestamps, id and status code) but it's not this type of logs that is reflected by the logs model web app anyways.

goswamig commented 2 years ago

@AlexandreBrown but previously you mentioned here that you can see the logs in terminal by running command

kubectl logs -n dev sklearn-iris-predictor-default-00001-deployment-5558697cb-t8q7b -c kfserving-container

Basically this is the log you'll see in web-app.

did anything change since then?

AlexandreBrown commented 2 years ago

@goswamig Nothing has changed, I previously mentioned that I was seeing the logs but the result was like the photos from my last message. So the logs of the model server is indeed accessible but it only shows what is printed when the model server successfully becomes alive. I had not tried to see the logs after sending requests, I tried today then realized that the logs that I thought were normal from kubectl was actually logs not showing any activity passed initialization.

So yes kubectl logs shows something but this something does not change passed the initialization logs.

goswamig commented 2 years ago

so I tried again today and logs we see is follow, this is same you should see in the pod.

 /usr/local/lib/python3.7/site-packages/sklearn/linear_model/randomized_l1.py:580: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here. 
 Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations 
   eps=4 * np.finfo(np.float).eps, n_jobs=None, 
 [I 220424 02:03:46 kfserver:151] Registering model: sklearn-iris 
 [I 220424 02:03:46 kfserver:121] Setting asyncio max_workers as 5 
 [I 220424 02:03:46 kfserver:128] Listening on port 8080 
 [I 220424 02:03:46 kfserver:130] Will fork 1 workers 
 [I 220429 21:15:35 web:2243] 200 POST /v1/models/sklearn-iris:predict (127.0.0.1) 1.38ms 
 [I 220429 21:15:50 web:2243] 200 POST /v1/models/sklearn-iris:predict (127.0.0.1) 0.84ms 
 [I 220429 21:16:00 web:2243] 200 POST /v1/models/sklearn-iris:predict (127.0.0.1) 0.76ms 
 [I 220429 21:17:50 web:2243] 200 POST /v1/models/sklearn-iris:predict (127.0.0.1) 0.74ms 
AlexandreBrown commented 2 years ago

@goswamig My logs are different than yours,

Mine stop at the following :


I0428 02:43:54.056516 1 grpc_server.cc:3983] Started GRPCInferenceService at 0.0.0.0:9000
I0428 02:43:54.056751 1 http_server.cc:2717] Started HTTPService at 0.0.0.0:8080
I0428 02:43:54.097922 1 http_server.cc:2736] Started Metrics Service at 0.0.0.0:8002

Then nothing new is logged even after new requests.

So perhaps this might suggest an issue that is deeper than just Kubeflow UI

goswamig commented 2 years ago

yeah so are u able to get the response back ?

AlexandreBrown commented 2 years ago

@goswamig Yes the inference works, we can query the model server and get a response back (HTTP 200) but the logs are not updated. When we run kubectl logs, nothing changes from the logs in my last message, even after thousands of requests.

surajkota commented 2 years ago

Hi Alex, does this issue still exist?

Do you face this issue with 1.5 as well? I have been using 1.5 in past few months, the logs show up on the dashboard

AlexandreBrown commented 2 years ago

Hi @surajkota , the issue remains for us with our current KF 1.4.1 install.

I have not tried inference with KF 1.5, will report back after trying it.

AlexandreBrown commented 1 year ago

Just tested Kubeflow 1.6.1 using aws repo on a test cluster (rds-s3-cognito) and I can see the logs.

image

Closing this issue as a "cannot reproduce".

surajkota commented 1 year ago

Thanks

AlexandreBrown commented 1 year ago

Update: Was able to reproduce on Kubeflow 1.7.0.
After some tests, it looks like the cause is autoscaling to 0.
When setting minReplicas to 0 , the inference service will scale down and therefore KServe won't be able to fetch any logs.
Pretty sure this is normal and expected because the logs seem to show live logs therefore if the inference service is not live anymore, there is no log to show.
Now the reason why I had this issue is clear to me.
Setting minReplicas to 1 shows the logs (but in our case we prefer to scale to and from 0 therefore we leave it at 0).
Hopefully this can help someone
Cheers