Bug fix for embedding task

I followed the instructions in readme:

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.1 \
  bash -c "cd /project && \
    convert_model -m \"sentence-transformers/msmarco-distilbert-cos-v5\" \
    --backend onnx \
    --task embedding \
    --seq-len 16 128 128"

docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
  bash -c "pip install transformers && tritonserver --model-repository=/models"

This fails with

I1101 17:05:46.144764 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f9006000000' with size 268435456
I1101 17:05:46.145324 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1101 17:05:46.152409 1 model_repository_manager.cc:1206] loading: transformer_onnx_tokenize:1
I1101 17:05:46.152455 1 model_repository_manager.cc:1206] loading: transformer_onnx_model:1
I1101 17:05:46.156017 1 onnxruntime.cc:2458] TRITONBACKEND_Initialize: onnxruntime
I1101 17:05:46.156072 1 onnxruntime.cc:2468] Triton TRITONBACKEND API version: 1.10
I1101 17:05:46.156099 1 onnxruntime.cc:2474] 'onnxruntime' TRITONBACKEND API version: 1.10
I1101 17:05:46.156122 1 onnxruntime.cc:2504] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I1101 17:05:46.172462 1 onnxruntime.cc:2560] TRITONBACKEND_ModelInitialize: transformer_onnx_model (version 1)
I1101 17:05:46.173322 1 onnxruntime.cc:666] skipping model configuration auto-complete for 'transformer_onnx_model': inputs and outputs already specified
I1101 17:05:46.177306 1 onnxruntime.cc:2603] TRITONBACKEND_ModelInstanceInitialize: transformer_onnx_model_0 (GPU device 0)
I1101 17:05:47.870828 1 onnxruntime.cc:2637] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I1101 17:05:47.870886 1 onnxruntime.cc:2583] TRITONBACKEND_ModelFinalize: delete model state
E1101 17:05:47.870905 1 model_repository_manager.cc:1355] failed to load 'transformer_onnx_model' version 1: Invalid argument: model 'transformer_onnx_model', tensor 'output': the model expects 2 dimensions (shape [-1,-1]) but the model configuration specifies 2 dimensions (shape [-1,768])
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
I1101 17:05:49.552957 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: transformer_onnx_tokenize_0 (GPU device 0)
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
I1101 17:05:50.105093 1 model_repository_manager.cc:1352] successfully loaded 'transformer_onnx_tokenize' version 1
E1101 17:05:50.105182 1 model_repository_manager.cc:1559] Invalid argument: ensemble 'transformer_onnx_inference' depends on 'transformer_onnx_model' which has no loaded version
I1101 17:05:50.105232 1 server.cc:559]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1101 17:05:50.105281 1 server.cc:586]
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                 |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------+
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6. |
|             |                                                                 | 000000","backend-directory":"/opt/tritonserver/backends","default-max- |
|             |                                                                 | batch-size":"4"}}                                                      |
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6. |
|             |                                                                 | 000000","backend-directory":"/opt/tritonserver/backends","default-max- |
|             |                                                                 | batch-size":"4"}}                                                      |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------+

I1101 17:05:50.105335 1 server.cc:629]
+---------------------------+---------+------------------------------------------------------------------------------------------------------------------+
| Model                     | Version | Status                                                                                                           |
+---------------------------+---------+------------------------------------------------------------------------------------------------------------------+
| transformer_onnx_model    | 1       | UNAVAILABLE: Invalid argument: model 'transformer_onnx_model', tensor 'output': the model expects 2 dimensions ( |
|                           |         | shape [-1,-1]) but the model configuration specifies 2 dimensions (shape [-1,768])                               |
| transformer_onnx_tokenize | 1       | READY                                                                                                            |
+---------------------------+---------+------------------------------------------------------------------------------------------------------------------+

I1101 17:05:50.153105 1 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
I1101 17:05:50.153454 1 tritonserver.cc:2176]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                               |
| server_version                   | 2.24.0                                                                                                               |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration sys |
|                                  | tem_shared_memory cuda_shared_memory binary_tensor_data statistics trace                                             |
| model_repository_path[0]         | /models                                                                                                              |
| model_control_mode               | MODE_NONE                                                                                                            |
| strict_model_config              | 0                                                                                                                    |
| rate_limit                       | OFF                                                                                                                  |
| pinned_memory_pool_byte_size     | 268435456                                                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                             |
| response_cache_byte_size         | 0                                                                                                                    |
| min_supported_compute_capability | 6.0                                                                                                                  |
| strict_readiness                 | 1                                                                                                                    |
| exit_timeout                     | 30                                                                                                                   |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------+

I1101 17:05:50.153486 1 server.cc:260] Waiting for in-flight requests to complete.
I1101 17:05:50.153492 1 server.cc:276] Timeout 30: Found 0 model versions that have in-flight inferences
I1101 17:05:50.153498 1 model_repository_manager.cc:1230] unloading: transformer_onnx_tokenize:1
I1101 17:05:50.153544 1 server.cc:291] All models are stopped, unloading models
I1101 17:05:50.153550 1 server.cc:298] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I1101 17:05:51.153647 1 server.cc:298] Timeout 29: Found 1 live models and 0 in-flight non-inference requests
I1101 17:05:51.241912 1 model_repository_manager.cc:1335] successfully unloaded 'transformer_onnx_tokenize' version 1
I1101 17:05:52.153849 1 server.cc:298] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

Setting output size to -1 solves the problem

ELS-RD / transformer-deploy

Bug fix for embedding task #152