ELS-RD / transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
https://els-rd.github.io/transformer-deploy/
Apache License 2.0
1.65k stars 150 forks source link

Token type ids bug #154

Closed fursovia closed 1 year ago

fursovia commented 1 year ago

Some models don't use token_type_ids in the forward pass. E.g. deberta has type_vocab_size=0 as a default value.

What happens is the model ignores token_type_ids (https://github.com/huggingface/transformers/blob/bac2d29a802803a7f2db8e8597a2ec81730afcc9/src/transformers/models/deberta/modeling_deberta.py#L810)

However, tokenizer doesn't know about this and token_type_ids is still in tokenizer.model_input_names.

This mismatch leads to

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.5.3 \
  bash -c "cd /project && \
    convert_model -m \"microsoft/deberta-base-mnli\" \
    --backend onnx \
    --seq-len 16 128 128"

docker run -itd --rm --gpus '"device=3"' -p8000:8000 --shm-size 256m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.07-py3 \
  bash -c "pip install transformers && tritonserver --model-repository=/models"

And the triton inference server fails with

I1123 13:49:09.821427 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fbf36000000' with size 268435456
I1123 13:49:09.821983 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1123 13:49:09.828017 1 model_repository_manager.cc:1206] loading: transformer_onnx_tokenize:1
I1123 13:49:09.828058 1 model_repository_manager.cc:1206] loading: transformer_onnx_model:1
I1123 13:49:09.830743 1 onnxruntime.cc:2458] TRITONBACKEND_Initialize: onnxruntime
I1123 13:49:09.830786 1 onnxruntime.cc:2468] Triton TRITONBACKEND API version: 1.10
I1123 13:49:09.830804 1 onnxruntime.cc:2474] 'onnxruntime' TRITONBACKEND API version: 1.10
I1123 13:49:09.830814 1 onnxruntime.cc:2504] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I1123 13:49:09.846110 1 onnxruntime.cc:2560] TRITONBACKEND_ModelInitialize: transformer_onnx_model (version 1)
I1123 13:49:09.847111 1 onnxruntime.cc:666] skipping model configuration auto-complete for 'transformer_onnx_model': inputs and outputs already specified
I1123 13:49:09.851839 1 onnxruntime.cc:2603] TRITONBACKEND_ModelInstanceInitialize: transformer_onnx_model_0 (GPU device 0)
I1123 13:49:12.063610 1 onnxruntime.cc:2637] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I1123 13:49:12.063688 1 onnxruntime.cc:2583] TRITONBACKEND_ModelFinalize: delete model state
E1123 13:49:12.063708 1 model_repository_manager.cc:1355] failed to load 'transformer_onnx_model' version 1: Invalid argument: unable to load model 'transformer_onnx_model', configuration expects 3 inputs, model provides 2
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
I1123 13:49:13.744756 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: transformer_onnx_tokenize_0 (GPU device 0)
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
I1123 13:49:14.298233 1 model_repository_manager.cc:1352] successfully loaded 'transformer_onnx_tokenize' version 1
E1123 13:49:14.298380 1 model_repository_manager.cc:1559] Invalid argument: ensemble 'transformer_onnx_inference' depends on 'transformer_onnx_model' which has no loaded version
I1123 13:49:14.298438 1 server.cc:559]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1123 13:49:14.298487 1 server.cc:586]
+-------------+----------------------------------------------------------------+----------------------------------------------------------------+
| Backend     | Path                                                           | Config                                                         |
+-------------+----------------------------------------------------------------+----------------------------------------------------------------+
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.s | {"cmdline":{"auto-complete-config":"true","min-compute-capabil |
|             | o                                                              | ity":"6.000000","backend-directory":"/opt/tritonserver/backend |
|             |                                                                | s","default-max-batch-size":"4"}}                              |
|             |                                                                |                                                                |
| python      | /opt/tritonserver/backends/python/libtriton_python.so          | {"cmdline":{"auto-complete-config":"true","min-compute-capabil |
|             |                                                                | ity":"6.000000","backend-directory":"/opt/tritonserver/backend |
|             |                                                                | s","default-max-batch-size":"4"}}                              |
+-------------+----------------------------------------------------------------+----------------------------------------------------------------+

I1123 13:49:14.298549 1 server.cc:629]
+---------------------------+---------+---------------------------------------------------------------------------------------------------------+
| Model                     | Version | Status                                                                                                  |
+---------------------------+---------+---------------------------------------------------------------------------------------------------------+
| transformer_onnx_model    | 1       | UNAVAILABLE: Invalid argument: unable to load model 'transformer_onnx_model', configuration expects 3 i |
|                           |         | nputs, model provides 2                                                                                 |
| transformer_onnx_tokenize | 1       | READY                                                                                                   |
+---------------------------+---------+---------------------------------------------------------------------------------------------------------+

I1123 13:49:14.351997 1 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
I1123 13:49:14.352405 1 tritonserver.cc:2176]
+----------------------------------+------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                      |
+----------------------------------+------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                     |
| server_version                   | 2.24.0                                                                                                     |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configu |
|                                  | ration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace                         |
| model_repository_path[0]         | /models                                                                                                    |
| model_control_mode               | MODE_NONE                                                                                                  |
| strict_model_config              | 0                                                                                                          |
| rate_limit                       | OFF                                                                                                        |
| pinned_memory_pool_byte_size     | 268435456                                                                                                  |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                   |
| response_cache_byte_size         | 0                                                                                                          |
| min_supported_compute_capability | 6.0                                                                                                        |
| strict_readiness                 | 1                                                                                                          |
| exit_timeout                     | 30                                                                                                         |
+----------------------------------+------------------------------------------------------------------------------------------------------------+

I1123 13:49:14.352443 1 server.cc:260] Waiting for in-flight requests to complete.
I1123 13:49:14.352453 1 server.cc:276] Timeout 30: Found 0 model versions that have in-flight inferences
I1123 13:49:14.352460 1 model_repository_manager.cc:1230] unloading: transformer_onnx_tokenize:1
I1123 13:49:14.352525 1 server.cc:291] All models are stopped, unloading models
I1123 13:49:14.352534 1 server.cc:298] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I1123 13:49:15.352620 1 server.cc:298] Timeout 29: Found 1 live models and 0 in-flight non-inference requests
I1123 13:49:15.444143 1 model_repository_manager.cc:1335] successfully unloaded 'transformer_onnx_tokenize' version 1
I1123 13:49:16.352790 1 server.cc:298] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

The proposed solution fixes this bug

fursovia commented 1 year ago

@ayoub-louati please take a look)

ayoub-louati commented 1 year ago

Sorry for being late, i'll test it very soon. @fursovia Thanks for this PR.

ayoub-louati commented 1 year ago

@fursovia Thank you very much for this fix and i'm really sorry for being late reviewing your PR, I was on other project.