kserve agent container in `CrashLoopBackOff` status with `error: unknown flag 'enable-puller'`

Bug Description

During working on https://github.com/canonical/bundle-kubeflow/issues/1077, I came across this issue with the kserve agent ROCK. It is a specific case where the InferenceService creates the agent container and tries to pass arguments to it. This is the same issue we were facing in https://github.com/canonical/katib-rocks/issues/49. The InferenceService Pod description is:

kubectl describe po llama3-8b-instruct-1xgpu-predictor-00001-deployment-c75859sbgjk
Name:             llama3-8b-instruct-1xgpu-predictor-00001-deployment-c75859sbgjk
Namespace:        admin
Priority:         0
Service Account:  default
Node:             ip-172-31-7-11/172.31.7.11
Start Time:       Wed, 16 Oct 2024 07:47:26 +0000
Labels:           app=llama3-8b-instruct-1xgpu-predictor-00001
                  component=predictor
                  pod-template-hash=c75859f7f
                  service.istio.io/canonical-name=llama3-8b-instruct-1xgpu-predictor
                  service.istio.io/canonical-revision=llama3-8b-instruct-1xgpu-predictor-00001
                  serving.knative.dev/configuration=llama3-8b-instruct-1xgpu-predictor
                  serving.knative.dev/configurationGeneration=1
                  serving.knative.dev/configurationUID=1b060239-b43b-4cdf-aeb4-0209df82f26e
                  serving.knative.dev/revision=llama3-8b-instruct-1xgpu-predictor-00001
                  serving.knative.dev/revisionUID=fb07d59d-9ded-48b5-b691-ac774b1b0cfe
                  serving.knative.dev/service=llama3-8b-instruct-1xgpu-predictor
                  serving.knative.dev/serviceUID=e426f02f-53d4-4cef-ae21-6ec406751c98
                  serving.kserve.io/inferenceservice=llama3-8b-instruct-1xgpu
Annotations:      autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
                  autoscaling.knative.dev/min-scale: 1
                  autoscaling.knative.dev/target: 10
                  cni.projectcalico.org/containerID: 0250b076c9c02e5c09d0b6c52bc8418cacd1565db3a2fa1d0472c94468dcbe81
                  cni.projectcalico.org/podIP: 10.1.32.188/32
                  cni.projectcalico.org/podIPs: 10.1.32.188/32
                  internal.serving.kserve.io/agent: true
                  internal.serving.kserve.io/configMountPath: /mnt/configs
                  internal.serving.kserve.io/configVolumeName: modelconfig-llama3-8b-instruct-1xgpu-0
                  internal.serving.kserve.io/modelDir: /mnt/models
                  prometheus.io/path: /metrics
                  prometheus.io/port: 9088
                  prometheus.kserve.io/path: /metrics
                  prometheus.kserve.io/port: 8000
                  serving.knative.dev/creator: system:serviceaccount:kubeflow:kserve-controller
                  serving.kserve.io/enable-metric-aggregation: true
                  serving.kserve.io/enable-prometheus-scraping: true
                  sidecar.istio.io/inject: false
Status:           Running
IP:               10.1.32.188
IPs:
  IP:           10.1.32.188
Controlled By:  ReplicaSet/llama3-8b-instruct-1xgpu-predictor-00001-deployment-c75859f7f
Containers:
  kserve-container:
    Container ID:   containerd://27b50f3895866f88f327ffec54335426355c800abc0064e6d3fe7e031fc5f71e
    Image:          nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
    Image ID:       nvcr.io/nim/meta/llama3-8b-instruct@sha256:7fe6071923b547edd9fba87c891a362ea0b4a88794b8a422d63127e54caa6ef7
    Port:           8000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 16 Oct 2024 07:47:26 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          16Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          16Gi
      nvidia.com/gpu:  1
    Environment:
      NIM_CACHE_PATH:   /tmp
      NGC_API_KEY:      <set to the key 'NGC_API_KEY' in secret 'ngc-nim-secret'>  Optional: false
      PORT:             8000
      K_REVISION:       llama3-8b-instruct-1xgpu-predictor-00001
      K_CONFIGURATION:  llama3-8b-instruct-1xgpu-predictor
      K_SERVICE:        llama3-8b-instruct-1xgpu-predictor
    Mounts:
      /dev/shm from dshm (rw)
      /mnt/models from model-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f9995 (ro)
  queue-proxy:
    Container ID:    containerd://7d020e01976ddd4eaf77b385c08cb7f2a87b2ad8b838e7796c216dfd66f36eba
    Image:           gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:89e6f90141f1b63405883fbb4de0d3b6d80f8b77e530904c4d29bdcd1dc5a167
    Image ID:        gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:89e6f90141f1b63405883fbb4de0d3b6d80f8b77e530904c4d29bdcd1dc5a167
    Ports:           8022/TCP, 9090/TCP, 9091/TCP, 8012/TCP, 8112/TCP, 9088/TCP
    Host Ports:      0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    SeccompProfile:  RuntimeDefault
    State:           Running
      Started:       Wed, 16 Oct 2024 07:47:27 +0000
    Ready:           False
    Restart Count:   0
    Requests:
      cpu:      25m
    Readiness:  http-get http://:8012/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      SERVING_NAMESPACE:                                 admin
      SERVING_SERVICE:                                   llama3-8b-instruct-1xgpu-predictor
      SERVING_CONFIGURATION:                             llama3-8b-instruct-1xgpu-predictor
      SERVING_REVISION:                                  llama3-8b-instruct-1xgpu-predictor-00001
      QUEUE_SERVING_PORT:                                8012
      QUEUE_SERVING_TLS_PORT:                            8112
      CONTAINER_CONCURRENCY:                             0
      REVISION_TIMEOUT_SECONDS:                          300
      REVISION_RESPONSE_START_TIMEOUT_SECONDS:           0
      REVISION_IDLE_TIMEOUT_SECONDS:                     0
      SERVING_POD:                                       llama3-8b-instruct-1xgpu-predictor-00001-deployment-c75859sbgjk (v1:metadata.name)
      SERVING_POD_IP:                                     (v1:status.podIP)
      SERVING_LOGGING_CONFIG:                            
      SERVING_LOGGING_LEVEL:                             
      SERVING_REQUEST_LOG_TEMPLATE:                      {"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header "X-B3-Traceid"}}"}
      SERVING_ENABLE_REQUEST_LOG:                        false
      SERVING_REQUEST_METRICS_BACKEND:                   prometheus
      SERVING_REQUEST_METRICS_REPORTING_PERIOD_SECONDS:  5
      TRACING_CONFIG_BACKEND:                            none
      TRACING_CONFIG_ZIPKIN_ENDPOINT:                    
      TRACING_CONFIG_DEBUG:                              false
      TRACING_CONFIG_SAMPLE_RATE:                        0.1
      USER_PORT:                                         9081
      SYSTEM_NAMESPACE:                                  knative-serving
      METRICS_DOMAIN:                                    knative.dev/internal/serving
      SERVING_READINESS_PROBE:                           {"tcpSocket":{"port":8000,"host":"127.0.0.1"},"successThreshold":1}
      ENABLE_PROFILING:                                  false
      SERVING_ENABLE_PROBE_REQUEST_LOG:                  false
      METRICS_COLLECTOR_ADDRESS:                         
      HOST_IP:                                            (v1:status.hostIP)
      ENABLE_HTTP2_AUTO_DETECTION:                       false
      ROOT_CA:                                           
      KSERVE_CONTAINER_PROMETHEUS_METRICS_PORT:          8000
      KSERVE_CONTAINER_PROMETHEUS_METRICS_PATH:          /metrics
      AGGREGATE_PROMETHEUS_METRICS_PORT:                 9088
      KSERVE_CONTAINER_PROMETHEUS_METRICS_PORT:          8000
      KSERVE_CONTAINER_PROMETHEUS_METRICS_PATH:          /metrics
      AGGREGATE_PROMETHEUS_METRICS_PORT:                 9088
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f9995 (ro)
  agent:
    Container ID:  containerd://c58427baedded1130cec91d74e203c88ef919f62819d665ee1b61171158eb947
    Image:         charmedkubeflow/kserve-agent:0.13.0-17792da
    Image ID:      docker.io/charmedkubeflow/kserve-agent@sha256:00825a7816ffffcbb1b262d4f47004182f788357ea0c5af14d1ae1d4a26620d1
    Port:          9081/TCP
    Host Port:     0/TCP
    Args:
      --enable-puller
      --config-dir
      /mnt/configs
      --model-dir
      /mnt/models
      --component-port
      8000
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 16 Oct 2024 07:47:49 +0000
      Finished:     Wed, 16 Oct 2024 07:47:49 +0000
    Ready:          False
    Restart Count:  2
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   100Mi
    Readiness:  http-get http://:9081/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      SERVING_NAMESPACE:                                 admin
      SERVING_SERVICE:                                   llama3-8b-instruct-1xgpu-predictor
      SERVING_CONFIGURATION:                             llama3-8b-instruct-1xgpu-predictor
      SERVING_REVISION:                                  llama3-8b-instruct-1xgpu-predictor-00001
      QUEUE_SERVING_PORT:                                8012
      QUEUE_SERVING_TLS_PORT:                            8112
      CONTAINER_CONCURRENCY:                             0
      REVISION_TIMEOUT_SECONDS:                          300
      REVISION_RESPONSE_START_TIMEOUT_SECONDS:           0
      REVISION_IDLE_TIMEOUT_SECONDS:                     0
      SERVING_POD:                                       llama3-8b-instruct-1xgpu-predictor-00001-deployment-c75859sbgjk (v1:metadata.name)
      SERVING_POD_IP:                                     (v1:status.podIP)
      SERVING_LOGGING_CONFIG:                            
      SERVING_LOGGING_LEVEL:                             
      SERVING_REQUEST_LOG_TEMPLATE:                      {"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header "X-B3-Traceid"}}"}
      SERVING_ENABLE_REQUEST_LOG:                        false
      SERVING_REQUEST_METRICS_BACKEND:                   prometheus
      SERVING_REQUEST_METRICS_REPORTING_PERIOD_SECONDS:  5
      TRACING_CONFIG_BACKEND:                            none
      TRACING_CONFIG_ZIPKIN_ENDPOINT:                    
      TRACING_CONFIG_DEBUG:                              false
      TRACING_CONFIG_SAMPLE_RATE:                        0.1
      USER_PORT:                                         8000
      SYSTEM_NAMESPACE:                                  knative-serving
      METRICS_DOMAIN:                                    knative.dev/internal/serving
      SERVING_READINESS_PROBE:                           {"tcpSocket":{"port":8000,"host":"127.0.0.1"},"successThreshold":1}
      ENABLE_PROFILING:                                  false
      SERVING_ENABLE_PROBE_REQUEST_LOG:                  false
      METRICS_COLLECTOR_ADDRESS:                         
      HOST_IP:                                            (v1:status.hostIP)
      ENABLE_HTTP2_AUTO_DETECTION:                       false
      ROOT_CA:                                           
    Mounts:
      /mnt/configs from model-config (rw)
      /mnt/models from model-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f9995 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  16Gi
  kube-api-access-f9995:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
  model-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  model-config:
    Type:        ConfigMap (a volume populated by a ConfigMap)
    Name:        modelconfig-llama3-8b-instruct-1xgpu-0
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  40s                default-scheduler  Successfully assigned admin/llama3-8b-instruct-1xgpu-predictor-00001-deployment-c75859sbgjk to ip-172-31-7-11
  Normal   Pulled     40s                kubelet            Container image "nvcr.io/nim/meta/llama3-8b-instruct:1.0.0" already present on machine
  Normal   Created    40s                kubelet            Created container kserve-container
  Normal   Started    40s                kubelet            Started container kserve-container
  Normal   Pulled     40s                kubelet            Container image "gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:89e6f90141f1b63405883fbb4de0d3b6d80f8b77e530904c4d29bdcd1dc5a167" already present on machine
  Normal   Created    40s                kubelet            Created container queue-proxy
  Normal   Started    39s                kubelet            Started container queue-proxy
  Warning  BackOff    30s (x3 over 38s)  kubelet            Back-off restarting failed container agent in pod llama3-8b-instruct-1xgpu-predictor-00001-deployment-c75859sbgjk_admin(2d8c6673-baef-4eae-baac-3874d54357d2)
  Warning  Unhealthy  29s                kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503
  Normal   Pulled     18s (x3 over 39s)  kubelet            Container image "charmedkubeflow/kserve-agent:0.13.0-17792da" already present on machine
  Normal   Created    17s (x3 over 39s)  kubelet            Created container agent
  Normal   Started    17s (x3 over 39s)  kubelet            Started container agent
  Warning  Unhealthy  16s (x6 over 38s)  kubelet            Readiness probe failed: Get "http://10.1.32.188:8012/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Notice in the agent container section the args are:

    Args:
      --enable-puller
      --config-dir
      /mnt/configs
      --model-dir
      /mnt/models
      --component-port
      8000

The agent container is in CrashLoopBackOff status with the error:

error: unknown flag `enable-puller'

To Reproduce

Deploy CKF 1.9/stable
Follow the steps in https://github.com/canonical/bundle-kubeflow/issues/1077#issuecomment-2413918966 to create the NIMs InferenceService
Get the logs for the InferenceService Pod

Environment

Microk8s 1.29/stable Juju 3.4/stable

Relevant Log Output

error: unknown flag `enable-puller'

Additional Context

No response

canonical / kserve-rocks