canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.4k stars 767 forks source link

Problem running a container witth mounted empty dir with memory medium #4405

Open leandregagnonlewis opened 7 months ago

leandregagnonlewis commented 7 months ago

Summary

I am running microk8s on a single ubuntu VM with 32 Gi of RAM so memory is not an issue on the machine side. I am trying to deploy a single replica of Nvidia Triton Inference Server wich allows to serve ML models. I am migrating from EKS to an on-prem solution and I am using exactly the same deployment config I used of EKS.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: release-name-triton-inference-server
  namespace: default
  labels:
    app: triton-inference-server
    chart: triton-inference-server-1.0.0
    release: release-name
    heritage: Helm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: triton-inference-server
      release: release-name
  template:
    metadata:
      labels:
        app: triton-inference-server
        release: release-name

    spec:
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"

      containers:
        - name: triton-inference-server
          image: "nvcr.io/nvidia/tritonserver:23.05-py3"
          imagePullPolicy: IfNotPresent

          resources:
            limits:
              nvidia.com/gpu: 0

          args: ["tritonserver", "--model-store=s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, 
                 "--model-control-mode=explicit",
                 "--load-model=vgg16_preprocessing",
                 "--load-model=bls_clust_v1",
                 "--log-verbose=2"]

          env:
          - name: AWS_DEFAULT_REGION
            valueFrom:
              secretKeyRef:
                name: triton-aws-credentials
                key: AWS_DEFAULT_REGION
          - name: AWS_ACCESS_KEY_ID
            valueFrom:
              secretKeyRef:
                name: triton-aws-credentials
                key: AWS_ACCESS_KEY_ID
          - name: AWS_SECRET_ACCESS_KEY
            valueFrom:
              secretKeyRef:
                name: triton-aws-credentials
                key: AWS_SECRET_ACCESS_KEY

          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          livenessProbe:
            httpGet:
              path: /v2/health/live
              port: http
          readinessProbe:
            initialDelaySeconds: 5
            periodSeconds: 5
            httpGet:
              path: /v2/health/ready
              port: http

          volumeMounts:
            - mountPath: /dev/shm
              name: dshm

      securityContext:
        runAsUser: 1000
        fsGroup: 1000

Now the pod starts as usual. I can see in the logs that the files on s3 are properly downloaded so the problem is not with the credentials. But after a few seconds the pod crash without any indication.

What Should Happen Instead?

The server should become healthy and wait for inference request to serve. I have try to deploy the server using docker on the same VM and it worked flawlessly so I guess the problem is with microk8s.

Here is my compose.yml

services:
  triton:
    image: "nvcr.io/nvidia/tritonserver:23.05-py3"
    ports:
      - 8000:8000
      - 8001:8001
      - 8002:8002
    environment:
      AWS_DEFAULT_REGION: xxxxxxxxxxx
      AWS_ACCESS_KEY_ID: xxxxxxxxxxxxxxxxxxxx
      AWS_SECRET_ACCESS_KEY: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    command: ["tritonserver",
                 "--model-store=s3:/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", 
                 "--model-control-mode=explicit",
                 "--load-model=vgg16_preprocessing",
                 "--load-model=bls_clust_v1",
                 "--log-verbose=2"]
    shm_size: 2g

Here are the logs, I have indicated the crash point when I run it on microk8s

triton-docker-triton-1  | 
triton-docker-triton-1  | =============================
triton-docker-triton-1  | == Triton Inference Server ==
triton-docker-triton-1  | =============================
triton-docker-triton-1  | 
triton-docker-triton-1  | 
triton-docker-triton-1  | NVIDIA Release 23.05 (build 61161506)
triton-docker-triton-1  | Triton Server Version 2.34.0
triton-docker-triton-1  | 
triton-docker-triton-1  | 
triton-docker-triton-1  | Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
triton-docker-triton-1  | 
triton-docker-triton-1  | 
triton-docker-triton-1  | Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
triton-docker-triton-1  | 
triton-docker-triton-1  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
triton-docker-triton-1  | By pulling and using the container, you accept the terms and conditions of this license:
triton-docker-triton-1  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
triton-docker-triton-1  | 
triton-docker-triton-1  | 
triton-docker-triton-1  | WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
triton-docker-triton-1  |    Use the NVIDIA Container Toolkit to start this container with GPU support; see
triton-docker-triton-1  |    https://docs.nvidia.com/datacenter/cloud-native/ .
triton-docker-triton-1  | 
triton-docker-triton-1  | 
triton-docker-triton-1  | 
triton-docker-triton-1  | I0209 16:52:15.556036 1 cache_manager.cc:478] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
triton-docker-triton-1  | W0209 16:52:15.556408 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
triton-docker-triton-1  | I0209 16:52:15.556441 1 cuda_memory_manager.cc:115] CUDA memory pool disabled
triton-docker-triton-1  | I0209 16:52:15.556458 1 filesystem.cc:2304] TRITON_CLOUD_CREDENTIAL_PATH environment variable is not set, reading from environment variables
triton-docker-triton-1  | I0209 16:52:15.556479 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository
triton-docker-triton-1  | I0209 16:52:16.020480 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing
triton-docker-triton-1  | I0209 16:52:16.115664 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1
triton-docker-triton-1  | I0209 16:52:16.238490 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1
triton-docker-triton-1  | I0209 16:52:16.336685 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1
triton-docker-triton-1  | I0209 16:52:16.436562 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1
triton-docker-triton-1  | I0209 16:52:16.508158 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/1
triton-docker-triton-1  | I0209 16:52:16.603639 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/1
triton-docker-triton-1  | I0209 16:52:16.695294 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/1
triton-docker-triton-1  | I0209 16:52:16.759803 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/1/model.py
triton-docker-triton-1  | I0209 16:52:16.850447 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/1/model.py
triton-docker-triton-1  | I0209 16:52:16.973553 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/bls-clustering-env.tar.gz
triton-docker-triton-1  | I0209 16:52:17.062773 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/bls-clustering-env.tar.gz
triton-docker-triton-1  | I0209 16:52:17.190986 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/config.pbtxt
triton-docker-triton-1  | I0209 16:52:17.283851 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/config.pbtxt
triton-docker-triton-1  | I0209 16:52:17.404671 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/environment.yml
triton-docker-triton-1  | I0209 16:52:17.503212 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/environment.yml
triton-docker-triton-1  | I0209 16:52:17.642365 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/kmeans_multi_cluster_crop_6_v1.npy
triton-docker-triton-1  | I0209 16:52:17.732900 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/kmeans_multi_cluster_crop_6_v1.npy
triton-docker-triton-1  | I0209 16:52:17.857739 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/kmeans_multi_cluster_positions_crop.npy
triton-docker-triton-1  | I0209 16:52:17.950181 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/kmeans_multi_cluster_positions_crop.npy
triton-docker-triton-1  | I0209 16:52:18.074596 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/test.py
triton-docker-triton-1  | I0209 16:52:18.164126 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/test.py
triton-docker-triton-1  | I0209 16:52:18.316061 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/config.pbtxt
triton-docker-triton-1  | I0209 16:52:18.460108 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/config.pbtxt
triton-docker-triton-1  | I0209 16:52:18.633279 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1
triton-docker-triton-1  | I0209 16:52:19.150042 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1/1
triton-docker-triton-1  | I0209 16:52:19.373893 1 model_config_utils.cc:647] Server side auto-completed config: name: "bls_clust_v1"
triton-docker-triton-1  | input {
triton-docker-triton-1  |   name: "INPUT"
triton-docker-triton-1  |   data_type: TYPE_UINT8
triton-docker-triton-1  |   dims: -1
triton-docker-triton-1  | }
triton-docker-triton-1  | output {
triton-docker-triton-1  |   name: "OUTPUT"
triton-docker-triton-1  |   data_type: TYPE_FP32
triton-docker-triton-1  |   dims: 1
triton-docker-triton-1  | }
triton-docker-triton-1  | output {
triton-docker-triton-1  |   name: "dists"
triton-docker-triton-1  |   data_type: TYPE_FP32
triton-docker-triton-1  |   dims: 5
triton-docker-triton-1  | }
triton-docker-triton-1  | instance_group {
triton-docker-triton-1  |   count: 2
triton-docker-triton-1  |   kind: KIND_CPU
triton-docker-triton-1  | }
triton-docker-triton-1  | default_model_filename: "model.py"
triton-docker-triton-1  | sequence_batching {
triton-docker-triton-1  |   max_sequence_idle_microseconds: 60000000
triton-docker-triton-1  |   direct {
triton-docker-triton-1  |   }
triton-docker-triton-1  | }
triton-docker-triton-1  | parameters {
triton-docker-triton-1  |   key: "EXECUTION_ENV_PATH"
triton-docker-triton-1  |   value {
triton-docker-triton-1  |     string_value: "$$TRITON_MODEL_DIRECTORY/bls-clustering-env.tar.gz"
triton-docker-triton-1  |   }
triton-docker-triton-1  | }
triton-docker-triton-1  | backend: "python"
triton-docker-triton-1  | 
triton-docker-triton-1  | I0209 16:52:19.373993 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing
triton-docker-triton-1  | 
triton-docker-triton-1  | I0209 16:52:19.474971 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing
triton-docker-triton-1  | I0209 16:52:19.577110 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing
triton-docker-triton-1  | I0209 16:52:19.678896 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1
triton-docker-triton-1  | I0209 16:52:19.806650 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1
triton-docker-triton-1  | I0209 16:52:19.914267 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1
triton-docker-triton-1  | I0209 16:52:19.989027 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel
triton-docker-triton-1  | I0209 16:52:20.089560 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel
triton-docker-triton-1  | I0209 16:52:20.193213 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel
triton-docker-triton-1  | I0209 16:52:20.262524 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel/fingerprint.pb
triton-docker-triton-1  | I0209 16:52:20.368017 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel/fingerprint.pb
triton-docker-triton-1  | I0209 16:52:20.492014 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel/keras_metadata.pb
triton-docker-triton-1  | I0209 16:52:20.582314 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel/keras_metadata.pb
triton-docker-triton-1  | I0209 16:52:20.706826 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel/saved_model.pb
triton-docker-triton-1  | I0209 16:52:20.797780 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel/saved_model.pb
triton-docker-triton-1  | I0209 16:52:20.922336 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel/variables
triton-docker-triton-1  | I0209 16:52:21.021022 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel/variables
triton-docker-triton-1  | I0209 16:52:21.117080 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel/variables
triton-docker-triton-1  | I0209 16:52:21.182442 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel/variables/variables.data-00000-of-00001
triton-docker-triton-1  | I0209 16:52:21.277828 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel/variables/variables.data-00000-of-00001
triton-docker-triton-1  | I0209 16:52:21.403064 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel/variables/variables.index
triton-docker-triton-1  | I0209 16:52:21.493935 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1/model.savedmodel/variables/variables.index
triton-docker-triton-1  | I0209 16:52:21.619999 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/config.pbtxt
triton-docker-triton-1  | I0209 16:52:21.710953 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/config.pbtxt
triton-docker-triton-1  | I0209 16:52:21.835252 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/config.pbtxt
triton-docker-triton-1  | I0209 16:52:21.958197 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/config.pbtxt
triton-docker-triton-1  | I0209 16:52:22.134437 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing
triton-docker-triton-1  | I0209 16:52:22.370151 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing/1
triton-docker-triton-1  | I0209 16:52:22.442005 1 model_config_utils.cc:647] Server side auto-completed config: name: "vgg16_preprocessing"
triton-docker-triton-1  | platform: "tensorflow_savedmodel"
triton-docker-triton-1  | input {
triton-docker-triton-1  |   name: "input_1"
triton-docker-triton-1  |   data_type: TYPE_FP32
triton-docker-triton-1  |   dims: -1
triton-docker-triton-1  |   dims: 224
triton-docker-triton-1  |   dims: 224
triton-docker-triton-1  |   dims: 3
triton-docker-triton-1  | }
triton-docker-triton-1  | output {
triton-docker-triton-1  |   name: "model"
triton-docker-triton-1  |   data_type: TYPE_FP32
triton-docker-triton-1  |   dims: -1
triton-docker-triton-1  |   dims: 4096
triton-docker-triton-1  | }
triton-docker-triton-1  | instance_group {
triton-docker-triton-1  |   count: 1
triton-docker-triton-1  |   kind: KIND_CPU
triton-docker-triton-1  | }
triton-docker-triton-1  | default_model_filename: "model.savedmodel"
triton-docker-triton-1  | backend: "tensorflow"
triton-docker-triton-1  | 
triton-docker-triton-1  | 
triton-docker-triton-1  | I0209 16:52:22.442129 1 model_lifecycle.cc:431] AsyncLoad() 'vgg16_preprocessing'
triton-docker-triton-1  | I0209 16:52:22.442156 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing
triton-docker-triton-1  | I0209 16:52:22.650199 1 model_lifecycle.cc:462] loading: vgg16_preprocessing:1
triton-docker-triton-1  | I0209 16:52:22.650234 1 model_lifecycle.cc:431] AsyncLoad() 'bls_clust_v1'
triton-docker-triton-1  | I0209 16:52:22.650243 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1
triton-docker-triton-1  | I0209 16:52:22.650260 1 model_lifecycle.cc:536] CreateModel() 'vgg16_preprocessing' version 1
triton-docker-triton-1  | I0209 16:52:22.650289 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/vgg16_preprocessing
triton-docker-triton-1  | I0209 16:52:23.380199 1 model_lifecycle.cc:462] loading: bls_clust_v1:1
triton-docker-triton-1  | I0209 16:52:23.380264 1 model_lifecycle.cc:536] CreateModel() 'bls_clust_v1' version 1
triton-docker-triton-1  | I0209 16:52:23.380349 1 filesystem.cc:2382] Using credential    for path  s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository/bls_clust_v1
triton-docker-triton-1  | I0209 16:52:42.122374 1 backend_model.cc:362] Adding default backend config setting: default-max-batch-size,4
triton-docker-triton-1  | I0209 16:52:42.122462 1 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so
triton-docker-triton-1  | I0209 16:52:42.125631 1 python_be.cc:1858] 'python' TRITONBACKEND API version: 1.12
triton-docker-triton-1  | I0209 16:52:42.125665 1 python_be.cc:1880] backend configuration:
triton-docker-triton-1  | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
triton-docker-triton-1  | I0209 16:52:42.125688 1 python_be.cc:2010] Shared memory configuration is shm-default-byte-size=67108864,shm-growth-byte-size=67108864,stub-timeout-seconds=30
triton-docker-triton-1  | I0209 16:52:42.125822 1 python_be.cc:2256] TRITONBACKEND_GetBackendAttribute: setting attributes
triton-docker-triton-1  | I0209 16:52:42.125886 1 python_be.cc:2058] TRITONBACKEND_ModelInitialize: bls_clust_v1 (version 1)
triton-docker-triton-1  | I0209 16:52:42.126810 1 model_config_utils.cc:1839] ModelConfig 64-bit fields:
triton-docker-triton-1  | I0209 16:52:42.126833 1 model_config_utils.cc:1841]   ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
triton-docker-triton-1  | I0209 16:52:42.126839 1 model_config_utils.cc:1841]   ModelConfig::dynamic_batching::max_queue_delay_microseconds
triton-docker-triton-1  | I0209 16:52:42.126844 1 model_config_utils.cc:1841]   ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
triton-docker-triton-1  | I0209 16:52:42.126849 1 model_config_utils.cc:1841]   ModelConfig::ensemble_scheduling::step::model_version
triton-docker-triton-1  | I0209 16:52:42.126854 1 model_config_utils.cc:1841]   ModelConfig::input::dims
triton-docker-triton-1  | I0209 16:52:42.126859 1 model_config_utils.cc:1841]   ModelConfig::input::reshape::shape
triton-docker-triton-1  | I0209 16:52:42.126864 1 model_config_utils.cc:1841]   ModelConfig::instance_group::secondary_devices::device_id
triton-docker-triton-1  | I0209 16:52:42.126869 1 model_config_utils.cc:1841]   ModelConfig::model_warmup::inputs::value::dims
triton-docker-triton-1  | I0209 16:52:42.126874 1 model_config_utils.cc:1841]   ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
triton-docker-triton-1  | I0209 16:52:42.126879 1 model_config_utils.cc:1841]   ModelConfig::optimization::cuda::graph_spec::input::value::dim
triton-docker-triton-1  | I0209 16:52:42.126884 1 model_config_utils.cc:1841]   ModelConfig::output::dims
triton-docker-triton-1  | I0209 16:52:42.126889 1 model_config_utils.cc:1841]   ModelConfig::output::reshape::shape
triton-docker-triton-1  | I0209 16:52:42.126894 1 model_config_utils.cc:1841]   ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
triton-docker-triton-1  | I0209 16:52:42.126899 1 model_config_utils.cc:1841]   ModelConfig::sequence_batching::max_sequence_idle_microseconds
triton-docker-triton-1  | I0209 16:52:42.126904 1 model_config_utils.cc:1841]   ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
triton-docker-triton-1  | I0209 16:52:42.126909 1 model_config_utils.cc:1841]   ModelConfig::sequence_batching::state::dims
triton-docker-triton-1  | I0209 16:52:42.126914 1 model_config_utils.cc:1841]   ModelConfig::sequence_batching::state::initial_state::dims
triton-docker-triton-1  | I0209 16:52:42.126920 1 model_config_utils.cc:1841]   ModelConfig::version_policy::specific::versions
triton-docker-triton-1  | I0209 16:52:42.127053 1 python_be.cc:1749] Using Python execution env /tmp/folderHvXdd6/bls-clustering-env.tar.gz
triton-docker-triton-1  | I0209 16:52:42.127122 1 pb_env.cc:271] Extracting Python execution env /tmp/folderHvXdd6/bls-clustering-env.tar.gz
triton-docker-triton-1  | I0209 16:52:44.976506 1 stub_launcher.cc:257] Starting Python backend stub: source /tmp/python_env_nVUDL5/0/bin/activate && exec env LD_LIBRARY_PATH=/tmp/python_env_nVUDL5/0/lib:$LD_LIBRARY_PATH /opt/tritonserver/backends/python/triton_python_backend_stub /tmp/folderHvXdd6/1/model.py triton_python_backend_shm_region_1 67108864 67108864 1 /opt/tritonserver/backends/python 336 bls_clust_v1
triton-docker-triton-1  | I0209 16:52:46.963162 1 python_be.cc:1838] model configuration:
triton-docker-triton-1  | {
triton-docker-triton-1  |     "name": "bls_clust_v1",
triton-docker-triton-1  |     "platform": "",
triton-docker-triton-1  |     "backend": "python",
triton-docker-triton-1  |     "version_policy": {
triton-docker-triton-1  |         "latest": {
triton-docker-triton-1  |             "num_versions": 1
triton-docker-triton-1  |         }
triton-docker-triton-1  |     },
triton-docker-triton-1  |     "max_batch_size": 0,
triton-docker-triton-1  |     "input": [
triton-docker-triton-1  |         {
triton-docker-triton-1  |             "name": "INPUT",
triton-docker-triton-1  |             "data_type": "TYPE_UINT8",
triton-docker-triton-1  |             "format": "FORMAT_NONE",
triton-docker-triton-1  |             "dims": [
triton-docker-triton-1  |                 -1
triton-docker-triton-1  |             ],
triton-docker-triton-1  |             "is_shape_tensor": false,
triton-docker-triton-1  |             "allow_ragged_batch": false,
triton-docker-triton-1  |             "optional": false
triton-docker-triton-1  |         }
triton-docker-triton-1  |     ],
triton-docker-triton-1  |     "output": [
triton-docker-triton-1  |         {
triton-docker-triton-1  |             "name": "OUTPUT",
triton-docker-triton-1  |             "data_type": "TYPE_FP32",
triton-docker-triton-1  |             "dims": [
triton-docker-triton-1  |                 1
triton-docker-triton-1  |             ],
triton-docker-triton-1  |             "label_filename": "",
triton-docker-triton-1  |             "is_shape_tensor": false
triton-docker-triton-1  |         },
triton-docker-triton-1  |         {
triton-docker-triton-1  |             "name": "dists",
triton-docker-triton-1  |             "data_type": "TYPE_FP32",
triton-docker-triton-1  |             "dims": [
triton-docker-triton-1  |                 5
triton-docker-triton-1  |             ],
triton-docker-triton-1  |             "label_filename": "",
triton-docker-triton-1  |             "is_shape_tensor": false
triton-docker-triton-1  |         }
triton-docker-triton-1  |     ],
triton-docker-triton-1  |     "batch_input": [],
triton-docker-triton-1  |     "batch_output": [],
triton-docker-triton-1  |     "optimization": {
triton-docker-triton-1  |         "priority": "PRIORITY_DEFAULT",
triton-docker-triton-1  |         "input_pinned_memory": {
triton-docker-triton-1  |             "enable": true
triton-docker-triton-1  |         },
triton-docker-triton-1  |         "output_pinned_memory": {
triton-docker-triton-1  |             "enable": true
triton-docker-triton-1  |         },
triton-docker-triton-1  |         "gather_kernel_buffer_threshold": 0,
triton-docker-triton-1  |         "eager_batching": false
triton-docker-triton-1  |     },
triton-docker-triton-1  |     "sequence_batching": {
triton-docker-triton-1  |         "direct": {
triton-docker-triton-1  |             "max_queue_delay_microseconds": 0,
triton-docker-triton-1  |             "minimum_slot_utilization": 0
triton-docker-triton-1  |         },
triton-docker-triton-1  |         "max_sequence_idle_microseconds": 60000000,
triton-docker-triton-1  |         "control_input": [],
triton-docker-triton-1  |         "state": []
triton-docker-triton-1  |     },
triton-docker-triton-1  |     "instance_group": [
triton-docker-triton-1  |         {
triton-docker-triton-1  |             "name": "bls_clust_v1_0",
triton-docker-triton-1  |             "kind": "KIND_CPU",
triton-docker-triton-1  |             "count": 2,
triton-docker-triton-1  |             "gpus": [],
triton-docker-triton-1  |             "secondary_devices": [],
triton-docker-triton-1  |             "profile": [],
triton-docker-triton-1  |             "passive": false,
triton-docker-triton-1  |             "host_policy": ""
triton-docker-triton-1  |         }
triton-docker-triton-1  |     ],
triton-docker-triton-1  |     "default_model_filename": "model.py",
triton-docker-triton-1  |     "cc_model_filenames": {},
triton-docker-triton-1  |     "metric_tags": {},
triton-docker-triton-1  |     "parameters": {
triton-docker-triton-1  |         "EXECUTION_ENV_PATH": {
triton-docker-triton-1  |             "string_value": "$$TRITON_MODEL_DIRECTORY/bls-clustering-env.tar.gz"
triton-docker-triton-1  |         }
triton-docker-triton-1  |     },
triton-docker-triton-1  |     "model_warmup": []
triton-docker-triton-1  | }
triton-docker-triton-1  | I0209 16:52:46.963399 1 python_be.cc:2102] TRITONBACKEND_ModelInstanceInitialize: bls_clust_v1_0_0 (CPU device 0)
triton-docker-triton-1  | I0209 16:52:46.963443 1 backend_model_instance.cc:68] Creating instance bls_clust_v1_0_0 on CPU using artifact 'model.py'
triton-docker-triton-1  | I0209 16:52:46.988791 1 stub_launcher.cc:257] Starting Python backend stub: source /tmp/python_env_nVUDL5/0/bin/activate && exec env LD_LIBRARY_PATH=/tmp/python_env_nVUDL5/0/lib:$LD_LIBRARY_PATH /opt/tritonserver/backends/python/triton_python_backend_stub /tmp/folderHvXdd6/1/model.py triton_python_backend_shm_region_2 67108864 67108864 1 /opt/tritonserver/backends/python 336 bls_clust_v1_0_0
triton-docker-triton-1  | I0209 16:52:47.729206 1 python_be.cc:2123] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful bls_clust_v1_0_0 (device 0)
triton-docker-triton-1  | I0209 16:52:47.729447 1 backend_model_instance.cc:800] Starting backend thread for bls_clust_v1_0_0 at nice 0 on device 0...
triton-docker-triton-1  | I0209 16:52:47.729603 1 python_be.cc:2102] TRITONBACKEND_ModelInstanceInitialize: bls_clust_v1_0_1 (CPU device 0)
triton-docker-triton-1  | I0209 16:52:47.729642 1 backend_model_instance.cc:68] Creating instance bls_clust_v1_0_1 on CPU using artifact 'model.py'
triton-docker-triton-1  | I0209 16:52:47.750876 1 stub_launcher.cc:257] Starting Python backend stub: source /tmp/python_env_nVUDL5/0/bin/activate && exec env LD_LIBRARY_PATH=/tmp/python_env_nVUDL5/0/lib:$LD_LIBRARY_PATH /opt/tritonserver/backends/python/triton_python_backend_stub /tmp/folderHvXdd6/1/model.py triton_python_backend_shm_region_3 67108864 67108864 1 /opt/tritonserver/backends/python 336 bls_clust_v1_0_1
triton-docker-triton-1  | I0209 16:52:48.559743 1 python_be.cc:2123] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful bls_clust_v1_0_1 (device 0)
triton-docker-triton-1  | I0209 16:52:48.559917 1 backend_model_instance.cc:800] Starting backend thread for bls_clust_v1_0_1 at nice 0 on device 0...
triton-docker-triton-1  | I0209 16:52:48.560177 1 sequence_batch_scheduler.cc:1276] Starting Direct sequence-batch scheduler thread 0 at nice 0...
triton-docker-triton-1  | I0209 16:52:48.560236 1 sequence_batch_scheduler.cc:1276] Starting Direct sequence-batch scheduler thread 1 at nice 0...
triton-docker-triton-1  | I0209 16:52:48.560313 1 model_lifecycle.cc:672] OnLoadComplete() 'bls_clust_v1' version 1
triton-docker-triton-1  | I0209 16:52:48.560401 1 model_lifecycle.cc:710] OnLoadFinal() 'bls_clust_v1' for all version(s)
triton-docker-triton-1  | I0209 16:52:48.560404 1 sequence_batch_scheduler.cc:819] Starting sequence-batch reaper thread at nice 10...
triton-docker-triton-1  | I0209 16:52:48.560439 1 model_lifecycle.cc:815] successfully loaded 'bls_clust_v1'

< -- Here is where it crash in microk8s -->
triton-docker-triton-1  | I0209 16:52:48.560440 1 sequence_batch_scheduler.cc:968] Reaper: sleeping for 60000000us...
triton-docker-triton-1  | I0209 16:53:48.560629 1 sequence_batch_scheduler.cc:968] Reaper: sleeping for 60000000us...
triton-docker-triton-1  | I0209 16:53:53.294257 1 backend_model.cc:362] Adding default backend config setting: default-max-batch-size,4
triton-docker-triton-1  | I0209 16:53:53.294316 1 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/tensorflow/libtriton_tensorflow.so
triton-docker-triton-1  | I0209 16:53:53.739514 1 tensorflow.cc:2577] TRITONBACKEND_Initialize: tensorflow
triton-docker-triton-1  | I0209 16:53:53.739571 1 tensorflow.cc:2587] Triton TRITONBACKEND API version: 1.12
triton-docker-triton-1  | I0209 16:53:53.739578 1 tensorflow.cc:2593] 'tensorflow' TRITONBACKEND API version: 1.12
triton-docker-triton-1  | I0209 16:53:53.739585 1 tensorflow.cc:2617] backend configuration:
triton-docker-triton-1  | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
triton-docker-triton-1  | I0209 16:53:53.739650 1 tensorflow.cc:2683] TRITONBACKEND_ModelInitialize: vgg16_preprocessing (version 1)
triton-docker-triton-1  | 2024-02-09 16:53:53.740456: I tensorflow/cc/saved_model/reader.cc:45] Reading SavedModel from: /tmp/foldervclxaf/1/model.savedmodel
triton-docker-triton-1  | 2024-02-09 16:53:53.746015: I tensorflow/cc/saved_model/reader.cc:89] Reading meta graph with tags { serve }
triton-docker-triton-1  | 2024-02-09 16:53:53.746094: I tensorflow/cc/saved_model/reader.cc:130] Reading SavedModel debug info (if present) from: /tmp/foldervclxaf/1/model.savedmodel
triton-docker-triton-1  | 2024-02-09 16:53:53.746307: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
triton-docker-triton-1  | To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
triton-docker-triton-1  | 2024-02-09 16:53:53.797125: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
triton-docker-triton-1  | 2024-02-09 16:53:53.799989: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
triton-docker-triton-1  | 2024-02-09 16:53:54.210414: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/foldervclxaf/1/model.savedmodel
triton-docker-triton-1  | 2024-02-09 16:53:54.249730: I tensorflow/cc/saved_model/loader.cc:334] SavedModel load for tags { serve }; Status: success: OK. Took 509286 microseconds.
triton-docker-triton-1  | I0209 16:53:54.287628 1 tensorflow.cc:1833] model configuration:
triton-docker-triton-1  | {
triton-docker-triton-1  |     "name": "vgg16_preprocessing",
triton-docker-triton-1  |     "platform": "tensorflow_savedmodel",
triton-docker-triton-1  |     "backend": "tensorflow",
triton-docker-triton-1  |     "version_policy": {
triton-docker-triton-1  |         "latest": {
triton-docker-triton-1  |             "num_versions": 1
triton-docker-triton-1  |         }
triton-docker-triton-1  |     },
triton-docker-triton-1  |     "max_batch_size": 0,
triton-docker-triton-1  |     "input": [
triton-docker-triton-1  |         {
triton-docker-triton-1  |             "name": "input_1",
triton-docker-triton-1  |             "data_type": "TYPE_FP32",
triton-docker-triton-1  |             "format": "FORMAT_NONE",
triton-docker-triton-1  |             "dims": [
triton-docker-triton-1  |                 -1,
triton-docker-triton-1  |                 224,
triton-docker-triton-1  |                 224,
triton-docker-triton-1  |                 3
triton-docker-triton-1  |             ],
triton-docker-triton-1  |             "is_shape_tensor": false,
triton-docker-triton-1  |             "allow_ragged_batch": false,
triton-docker-triton-1  |             "optional": false
triton-docker-triton-1  |         }
triton-docker-triton-1  |     ],
triton-docker-triton-1  |     "output": [
triton-docker-triton-1  |         {
triton-docker-triton-1  |             "name": "model",
triton-docker-triton-1  |             "data_type": "TYPE_FP32",
triton-docker-triton-1  |             "dims": [
triton-docker-triton-1  |                 -1,
triton-docker-triton-1  |                 4096
triton-docker-triton-1  |             ],
triton-docker-triton-1  |             "label_filename": "",
triton-docker-triton-1  |             "is_shape_tensor": false
triton-docker-triton-1  |         }
triton-docker-triton-1  |     ],
triton-docker-triton-1  |     "batch_input": [],
triton-docker-triton-1  |     "batch_output": [],
triton-docker-triton-1  |     "optimization": {
triton-docker-triton-1  |         "priority": "PRIORITY_DEFAULT",
triton-docker-triton-1  |         "input_pinned_memory": {
triton-docker-triton-1  |             "enable": true
triton-docker-triton-1  |         },
triton-docker-triton-1  |         "output_pinned_memory": {
triton-docker-triton-1  |             "enable": true
triton-docker-triton-1  |         },
triton-docker-triton-1  |         "gather_kernel_buffer_threshold": 0,
triton-docker-triton-1  |         "eager_batching": false
triton-docker-triton-1  |     },
triton-docker-triton-1  |     "instance_group": [
triton-docker-triton-1  |         {
triton-docker-triton-1  |             "name": "vgg16_preprocessing_0",
triton-docker-triton-1  |             "kind": "KIND_CPU",
triton-docker-triton-1  |             "count": 1,
triton-docker-triton-1  |             "gpus": [],
triton-docker-triton-1  |             "secondary_devices": [],
triton-docker-triton-1  |             "profile": [],
triton-docker-triton-1  |             "passive": false,
triton-docker-triton-1  |             "host_policy": ""
triton-docker-triton-1  |         }
triton-docker-triton-1  |     ],
triton-docker-triton-1  |     "default_model_filename": "model.savedmodel",
triton-docker-triton-1  |     "cc_model_filenames": {},
triton-docker-triton-1  |     "metric_tags": {},
triton-docker-triton-1  |     "parameters": {},
triton-docker-triton-1  |     "model_warmup": []
triton-docker-triton-1  | }
triton-docker-triton-1  | I0209 16:53:54.287858 1 tensorflow.cc:2732] TRITONBACKEND_ModelInstanceInitialize: vgg16_preprocessing_0 (CPU device 0)
triton-docker-triton-1  | I0209 16:53:54.287878 1 backend_model_instance.cc:68] Creating instance vgg16_preprocessing_0 on CPU using artifact 'model.savedmodel'
triton-docker-triton-1  | 2024-02-09 16:53:54.288045: I tensorflow/cc/saved_model/reader.cc:45] Reading SavedModel from: /tmp/foldervclxaf/1/model.savedmodel
triton-docker-triton-1  | 2024-02-09 16:53:54.291866: I tensorflow/cc/saved_model/reader.cc:89] Reading meta graph with tags { serve }
triton-docker-triton-1  | 2024-02-09 16:53:54.291925: I tensorflow/cc/saved_model/reader.cc:130] Reading SavedModel debug info (if present) from: /tmp/foldervclxaf/1/model.savedmodel
triton-docker-triton-1  | 2024-02-09 16:53:54.302371: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
triton-docker-triton-1  | 2024-02-09 16:53:54.716398: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/foldervclxaf/1/model.savedmodel
triton-docker-triton-1  | 2024-02-09 16:53:54.754026: I tensorflow/cc/saved_model/loader.cc:334] SavedModel load for tags { serve }; Status: success: OK. Took 465985 microseconds.
triton-docker-triton-1  | I0209 16:53:54.754252 1 backend_model_instance.cc:800] Starting backend thread for vgg16_preprocessing_0 at nice 0 on device 0...
triton-docker-triton-1  | I0209 16:53:54.754479 1 model_lifecycle.cc:672] OnLoadComplete() 'vgg16_preprocessing' version 1
triton-docker-triton-1  | I0209 16:53:54.754514 1 model_lifecycle.cc:710] OnLoadFinal() 'vgg16_preprocessing' for all version(s)
triton-docker-triton-1  | I0209 16:53:54.754525 1 model_lifecycle.cc:815] successfully loaded 'vgg16_preprocessing'
triton-docker-triton-1  | I0209 16:53:54.754644 1 model_lifecycle.cc:286] VersionStates() 'vgg16_preprocessing'
triton-docker-triton-1  | I0209 16:53:54.754710 1 model_lifecycle.cc:286] VersionStates() 'bls_clust_v1'
triton-docker-triton-1  | I0209 16:53:54.754991 1 model_lifecycle.cc:286] VersionStates() 'bls_clust_v1'
triton-docker-triton-1  | I0209 16:53:54.755005 1 model_lifecycle.cc:286] VersionStates() 'vgg16_preprocessing'
triton-docker-triton-1  | I0209 16:53:54.755045 1 server.cc:582] 
triton-docker-triton-1  | +------------------+------+
triton-docker-triton-1  | | Repository Agent | Path |
triton-docker-triton-1  | +------------------+------+
triton-docker-triton-1  | +------------------+------+
triton-docker-triton-1  | 
triton-docker-triton-1  | 
triton-docker-triton-1  | I0209 16:53:54.755227 1 server.cc:609] 
triton-docker-triton-1  | +------------+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-docker-triton-1  | | Backend    | Path                                                          | Config                                                                                                                                                        |
triton-docker-triton-1  | +------------+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-docker-triton-1  | | python     | /opt/tritonserver/backends/python/libtriton_python.so         | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
triton-docker-triton-1  | | tensorflow | /opt/tritonserver/backends/tensorflow/libtriton_tensorflow.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
triton-docker-triton-1  | +------------+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-docker-triton-1  | 
triton-docker-triton-1  | I0209 16:53:54.755241 1 model_lifecycle.cc:265] ModelStates()
triton-docker-triton-1  | I0209 16:53:54.755288 1 server.cc:652] 
triton-docker-triton-1  | +---------------------+---------+--------+
triton-docker-triton-1  | | Model               | Version | Status |
triton-docker-triton-1  | +---------------------+---------+--------+
triton-docker-triton-1  | | bls_clust_v1        | 1       | READY  |
triton-docker-triton-1  | | vgg16_preprocessing | 1       | READY  |
triton-docker-triton-1  | +---------------------+---------+--------+
triton-docker-triton-1  | 
triton-docker-triton-1  | 
triton-docker-triton-1  | I0209 16:53:54.755520 1 metrics.cc:701] Collecting CPU metrics
triton-docker-triton-1  | I0209 16:53:54.755753 1 tritonserver.cc:2385] 
triton-docker-triton-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-docker-triton-1  | | Option                           | Value                                                                                                                                                                                                           |
triton-docker-triton-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-docker-triton-1  | | server_id                        | triton                                                                                                                                                                                                          |
triton-docker-triton-1  | | server_version                   | 2.34.0                                                                                                                                                                                                          |
triton-docker-triton-1  | | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
triton-docker-triton-1  | | model_repository_path[0]         | s3://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/model_repository                                                                                                                                                 |
triton-docker-triton-1  | | model_control_mode               | MODE_EXPLICIT                                                                                                                                                                                                   |
triton-docker-triton-1  | | startup_models_0                 | bls_clust_v1                                                                                                                                                                                                    |
triton-docker-triton-1  | | startup_models_1                 | vgg16_preprocessing                                                                                                                                                                                             |
triton-docker-triton-1  | | strict_model_config              | 0                                                                                                                                                                                                               |
triton-docker-triton-1  | | rate_limit                       | OFF                                                                                                                                                                                                             |
triton-docker-triton-1  | | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
triton-docker-triton-1  | | min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
triton-docker-triton-1  | | strict_readiness                 | 1                                                                                                                                                                                                               |
triton-docker-triton-1  | | exit_timeout                     | 30                                                                                                                                                                                                              |
triton-docker-triton-1  | | cache_enabled                    | 0                                                                                                                                                                                                               |
triton-docker-triton-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-docker-triton-1  | 
triton-docker-triton-1  | 
triton-docker-triton-1  | I0209 16:53:54.756417 1 grpc_server.cc:2344] 
triton-docker-triton-1  | +----------------------------------------------+---------+
triton-docker-triton-1  | | GRPC KeepAlive Option                        | Value   |
triton-docker-triton-1  | +----------------------------------------------+---------+
triton-docker-triton-1  | | keepalive_time_ms                            | 7200000 |
triton-docker-triton-1  | | keepalive_timeout_ms                         | 20000   |
triton-docker-triton-1  | | keepalive_permit_without_calls               | 0       |
triton-docker-triton-1  | | http2_max_pings_without_data                 | 2       |
triton-docker-triton-1  | | http2_min_recv_ping_interval_without_data_ms | 300000  |
triton-docker-triton-1  | | http2_max_ping_strikes                       | 2       |
triton-docker-triton-1  | +----------------------------------------------+---------+
triton-docker-triton-1  | 
triton-docker-triton-1  | 
triton-docker-triton-1  | I0209 16:53:54.757158 1 grpc_server.cc:128] Ready for RPC 'Check', 0
triton-docker-triton-1  | I0209 16:53:54.757199 1 grpc_server.cc:128] Ready for RPC 'ServerLive', 0
triton-docker-triton-1  | I0209 16:53:54.757209 1 grpc_server.cc:128] Ready for RPC 'ServerReady', 0
triton-docker-triton-1  | I0209 16:53:54.757222 1 grpc_server.cc:128] Ready for RPC 'ModelReady', 0
triton-docker-triton-1  | I0209 16:53:54.757231 1 grpc_server.cc:128] Ready for RPC 'ServerMetadata', 0
triton-docker-triton-1  | I0209 16:53:54.757240 1 grpc_server.cc:128] Ready for RPC 'ModelMetadata', 0
triton-docker-triton-1  | I0209 16:53:54.757259 1 grpc_server.cc:128] Ready for RPC 'ModelConfig', 0
triton-docker-triton-1  | I0209 16:53:54.757272 1 grpc_server.cc:128] Ready for RPC 'SystemSharedMemoryStatus', 0
triton-docker-triton-1  | I0209 16:53:54.757283 1 grpc_server.cc:128] Ready for RPC 'SystemSharedMemoryRegister', 0
triton-docker-triton-1  | I0209 16:53:54.757297 1 grpc_server.cc:128] Ready for RPC 'SystemSharedMemoryUnregister', 0
triton-docker-triton-1  | I0209 16:53:54.757309 1 grpc_server.cc:128] Ready for RPC 'CudaSharedMemoryStatus', 0
triton-docker-triton-1  | I0209 16:53:54.757317 1 grpc_server.cc:128] Ready for RPC 'CudaSharedMemoryRegister', 0
triton-docker-triton-1  | I0209 16:53:54.757327 1 grpc_server.cc:128] Ready for RPC 'CudaSharedMemoryUnregister', 0
triton-docker-triton-1  | I0209 16:53:54.757341 1 grpc_server.cc:128] Ready for RPC 'RepositoryIndex', 0
triton-docker-triton-1  | I0209 16:53:54.757351 1 grpc_server.cc:128] Ready for RPC 'RepositoryModelLoad', 0
triton-docker-triton-1  | I0209 16:53:54.757361 1 grpc_server.cc:128] Ready for RPC 'RepositoryModelUnload', 0
triton-docker-triton-1  | I0209 16:53:54.757373 1 grpc_server.cc:128] Ready for RPC 'ModelStatistics', 0
triton-docker-triton-1  | I0209 16:53:54.757386 1 grpc_server.cc:128] Ready for RPC 'Trace', 0
triton-docker-triton-1  | I0209 16:53:54.757399 1 grpc_server.cc:128] Ready for RPC 'Logging', 0
triton-docker-triton-1  | I0209 16:53:54.757425 1 grpc_server.cc:377] Thread started for CommonHandler
triton-docker-triton-1  | I0209 16:53:54.757566 1 infer_handler.cc:629] New request handler for ModelInferHandler, 0
triton-docker-triton-1  | I0209 16:53:54.757633 1 infer_handler.h:1025] Thread started for ModelInferHandler
triton-docker-triton-1  | I0209 16:53:54.757761 1 infer_handler.cc:629] New request handler for ModelInferHandler, 0
triton-docker-triton-1  | I0209 16:53:54.757816 1 infer_handler.h:1025] Thread started for ModelInferHandler
triton-docker-triton-1  | I0209 16:53:54.757974 1 stream_infer_handler.cc:122] New request handler for ModelStreamInferHandler, 0
triton-docker-triton-1  | I0209 16:53:54.758016 1 infer_handler.h:1025] Thread started for ModelStreamInferHandler
triton-docker-triton-1  | I0209 16:53:54.758023 1 grpc_server.cc:2450] Started GRPCInferenceService at 0.0.0.0:8001
triton-docker-triton-1  | I0209 16:53:54.758321 1 http_server.cc:3555] Started HTTPService at 0.0.0.0:8000
triton-docker-triton-1  | I0209 16:53:54.799641 1 http_server.cc:185] Started Metrics Service at 0.0.0.0:8002 

Reproduction Steps

  1. Deploy Triton with the same manifest I have used. For the models to serve, try to use a simple VGG16 pretrained on imageNet deployed with a tensorflow backend as a saved Model files. This should cause the model to crash.
  2. I have tryed to deploy both models independently and both make the pod to crash, but not at the same exact place.
  3. When deploying without serving any models, the pod do not crash.

Introspection Report

inspection-report-20240209_114600.tar.gz

Can you suggest a fix?

Not sure, but I thinks this might be related to the high memory usage. Triton needs to access memory at /dev/shm so even in EKS, I needed to use the emptyDir strategy to mount memory to this path if not I had the same kind of crash. It is as if the strategy is not working in mircok8s.

Are you interested in contributing with a fix?

Sure

robertaistleitner commented 5 months ago

I have the same issue if I try to increase the size of /dev/shm using a emptyDir volume with medium: Memory. This is necessary for increasing the shared memory which postgres uses. For now I just removed the medium: Memory for use with microk8s and only use it for the kubernetes production cluster.

I had a really hard time finding out that this was the issue because my pod just failed without any proper error message.