Failed to load 'transformer_onnx_model'

Hi, I am trying to build a dense embedding inference using this repository. When I tried using the model on the example (sentence-transformers/msmarco-distilbert-cos-v5) it worked fine. However, using a different model, even the model from sentence-transformers itself, like sentence-transformers/all-MiniLM-L6-v2, won't run the inference. Here are the full logs:

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 22.07 (build 41737377)
Triton Server Version 2.24.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.0/7.0 MB 9.2 MB/s eta 0:00:00
Collecting tqdm>=4.27
  Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.1/77.1 kB 5.8 MB/s eta 0:00:00
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 9.4 MB/s eta 0:00:00
Collecting regex!=2019.12.17
  Downloading regex-2023.5.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 771.9/771.9 kB 8.9 MB/s eta 0:00:00
Collecting packaging>=20.0
  Downloading packaging-23.1-py3-none-any.whl (48 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48.9/48.9 kB 3.7 MB/s eta 0:00:00
Requirement already satisfied: requests in /usr/lib/python3/dist-packages (from transformers) (2.22.0)
Collecting filelock
  Downloading filelock-3.12.0-py3-none-any.whl (10 kB)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.8/dist-packages (from transformers) (1.23.1)
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 224.5/224.5 kB 9.1 MB/s eta 0:00:00
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (701 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 701.2/701.2 kB 9.0 MB/s eta 0:00:00
Collecting typing-extensions>=3.7.4.3
  Downloading typing_extensions-4.5.0-py3-none-any.whl (27 kB)
Collecting fsspec
  Downloading fsspec-2023.4.0-py3-none-any.whl (153 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 154.0/154.0 kB 7.4 MB/s eta 0:00:00
Installing collected packages: tokenizers, typing-extensions, tqdm, regex, pyyaml, packaging, fsspec, filelock, huggingface-hub, transformers
Successfully installed filelock-3.12.0 fsspec-2023.4.0 huggingface-hub-0.14.1 packaging-23.1 pyyaml-6.0 regex-2023.5.4 tokenizers-0.13.3 tqdm-4.65.0 transformers-4.28.1 typing-extensions-4.5.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/utils/logging.py", line 177, in emit
    self.console.print(renderable, overflow="ignore", crop=False, style=style)
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/console.py", line 1673, in print
    extend(render(renderable, render_options))
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/console.py", line 1305, in render
    for render_output in iter_render:
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/utils/logging.py", line 134, in __rich_console__
    for line in lines:
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/segment.py", line 249, in split_lines
    for segment in segments:
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/console.py", line 1283, in render
    renderable = rich_cast(renderable)
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/protocol.py", line 36, in rich_cast
    renderable = cast_method()
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/self_outdated_check.py", line 130, in __rich__
    pip_cmd = get_best_invocation_for_this_pip()
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/utils/entrypoints.py", line 58, in get_best_invocation_for_this_pip
    if found_executable and os.path.samefile(
  File "/usr/lib/python3.8/genericpath.py", line 101, in samefile
    s2 = os.stat(f2)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/bin/pip3.8'
Call stack:
  File "/usr/local/bin/pip", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/cli/main.py", line 70, in main
    return command.main(cmd_args)
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/cli/base_command.py", line 101, in main
    return self._main(args)
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/cli/base_command.py", line 223, in _main
    self.handle_pip_version_check(options)
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/cli/req_command.py", line 190, in handle_pip_version_check
    pip_self_version_check(session, options)
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/self_outdated_check.py", line 236, in pip_self_version_check
    logger.warning("[present-rich] %s", upgrade_prompt)
  File "/usr/lib/python3.8/logging/__init__.py", line 1458, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib/python3.8/logging/__init__.py", line 1589, in _log
    self.handle(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 1599, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 1661, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
    self.emit(record)
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/utils/logging.py", line 179, in emit
    self.handleError(record)
Message: '[present-rich] %s'
Arguments: (UpgradePrompt(old='22.2.1', new='23.1.2'),)
W0502 22:17:28.840321 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
I0502 22:17:28.840805 1 cuda_memory_manager.cc:115] CUDA memory pool disabled
I0502 22:17:29.264492 1 model_repository_manager.cc:1206] loading: transformer_onnx_tokenize:1
I0502 22:17:29.276594 1 model_repository_manager.cc:1206] loading: transformer_onnx_model:1
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
I0502 22:17:32.401196 1 onnxruntime.cc:2458] TRITONBACKEND_Initialize: onnxruntime
I0502 22:17:32.401275 1 onnxruntime.cc:2468] Triton TRITONBACKEND API version: 1.10
I0502 22:17:32.401310 1 onnxruntime.cc:2474] 'onnxruntime' TRITONBACKEND API version: 1.10
I0502 22:17:32.401328 1 onnxruntime.cc:2504] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I0502 22:17:32.427419 1 onnxruntime.cc:2560] TRITONBACKEND_ModelInitialize: transformer_onnx_model (version 1)
I0502 22:17:32.428538 1 onnxruntime.cc:666] skipping model configuration auto-complete for 'transformer_onnx_model': inputs and outputs already specified
I0502 22:17:32.429336 1 onnxruntime.cc:2603] TRITONBACKEND_ModelInstanceInitialize: transformer_onnx_model_0 (CPU device 0)
I0502 22:17:32.769880 1 onnxruntime.cc:2637] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0502 22:17:32.769920 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: transformer_onnx_tokenize_0 (CPU device 0)
I0502 22:17:32.770159 1 onnxruntime.cc:2583] TRITONBACKEND_ModelFinalize: delete model state
E0502 22:17:32.770347 1 model_repository_manager.cc:1355] failed to load 'transformer_onnx_model' version 1: Invalid argument: model 'transformer_onnx_model', tensor 'output': the model expects 2 dimensions (shape [-1,384]) but the model configuration specifies 2 dimensions (shape [-1,-1])
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
I0502 22:17:34.640107 1 model_repository_manager.cc:1352] successfully loaded 'transformer_onnx_tokenize' version 1
E0502 22:17:34.640456 1 model_repository_manager.cc:1559] Invalid argument: ensemble 'transformer_onnx_inference' depends on 'transformer_onnx_model' which has no loaded version
I0502 22:17:34.640824 1 server.cc:559]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0502 22:17:34.640909 1 server.cc:586]
+-------------+----------------------------------+----------------------------------+
| Backend     | Path                             | Config                           |
+-------------+----------------------------------+----------------------------------+
| python      | /opt/tritonserver/backends/pytho | {"cmdline":{"auto-complete-confi |
|             | n/libtriton_python.so            | g":"true","min-compute-capabilit |
|             |                                  | y":"6.000000","backend-directory |
|             |                                  | ":"/opt/tritonserver/backends"," |
|             |                                  | default-max-batch-size":"4"}}    |
|             |                                  |                                  |
| onnxruntime | /opt/tritonserver/backends/onnxr | {"cmdline":{"auto-complete-confi |
|             | untime/libtriton_onnxruntime.so  | g":"true","min-compute-capabilit |
|             |                                  | y":"6.000000","backend-directory |
|             |                                  | ":"/opt/tritonserver/backends"," |
|             |                                  | default-max-batch-size":"4"}}    |
|             |                                  |                                  |
+-------------+----------------------------------+----------------------------------+

I0502 22:17:34.641023 1 server.cc:629]
+---------------------------+---------+---------------------------------------------+
| Model                     | Version | Status                                      |
+---------------------------+---------+---------------------------------------------+
| transformer_onnx_model    | 1       | UNAVAILABLE: Invalid argument: model 'trans |
|                           |         | former_onnx_model', tensor 'output': the mo |
|                           |         | del expects 2 dimensions (shape [-1,384]) b |
|                           |         | ut the model configuration specifies 2 dime |
|                           |         | nsions (shape [-1,-1])                      |
| transformer_onnx_tokenize | 1       | READY                                       |
+---------------------------+---------+---------------------------------------------+

I0502 22:17:34.641665 1 tritonserver.cc:2176]
+----------------------------------+------------------------------------------------+
| Option                           | Value                                          |
+----------------------------------+------------------------------------------------+
| server_id                        | triton                                         |
| server_version                   | 2.24.0                                         |
| server_extensions                | classification sequence model_repository model |
|                                  | _repository(unload_dependents) schedule_policy |
|                                  |  model_configuration system_shared_memory cuda |
|                                  | _shared_memory binary_tensor_data statistics t |
|                                  | race                                           |
| model_repository_path[0]         | /models                                        |
| model_control_mode               | MODE_NONE                                      |
| strict_model_config              | 0                                              |
| rate_limit                       | OFF                                            |
| pinned_memory_pool_byte_size     | 268435456                                      |
| response_cache_byte_size         | 0                                              |
| min_supported_compute_capability | 6.0                                            |
| strict_readiness                 | 1                                              |
| exit_timeout                     | 30                                             |
+----------------------------------+------------------------------------------------+

I0502 22:17:34.641753 1 server.cc:260] Waiting for in-flight requests to complete.
I0502 22:17:34.641779 1 server.cc:276] Timeout 30: Found 0 model versions that have in-flight inferences
I0502 22:17:34.641788 1 model_repository_manager.cc:1230] unloading: transformer_onnx_tokenize:1
I0502 22:17:34.641964 1 server.cc:291] All models are stopped, unloading models
I0502 22:17:34.641999 1 server.cc:298] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I0502 22:17:35.642214 1 server.cc:298] Timeout 29: Found 1 live models and 0 in-flight non-inference requests
I0502 22:17:35.867722 1 model_repository_manager.cc:1335] successfully unloaded 'transformer_onnx_tokenize' version 1
I0502 22:17:36.642586 1 server.cc:298] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

I am using a CPU-only device to test it, although I have tried it on a GPU device which also doesn't work. Would you happen to have any idea on how to fix it? here's the command I use:

docker run -it --rm -v $PWD":/project" ghcr.io/els-rd/transformer-deploy:latest bash -c "cd /project && convert_model -m 'sentence-transformers/all-MiniLM-L6-v2' --backend onnx --task embedding --seq-len 16 128 128"

docker run -it --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 bash -c "pip install transformers && tritonserver --model-repository=/models"

ELS-RD / transformer-deploy

Failed to load 'transformer_onnx_model' #172