Closed bbartlett-nv closed 2 years ago
Thanks @bbartlett-nv for the question. Just to clarify, when you are generating texts of different sizes it doesn't work?
If you can provide some end to end reproducible example it would be helpful.
Moreover, below is a page related to how to gen packet to send to Triton: https://github.com/ELS-RD/transformer-deploy/blob/main/docs/run.md#query-the-inference-server
thanks for the prompt response! I will try to explain a reproduceable example:
1) Run the following from within ghcr.io/els-rd/transformer-deploy:0.4.0
convert.py -m sentence-transformers/msmarco-distilbert-cos-v5 --backend onnx tensorrt --task embedding --seq-len 2 128 256 --batch-size 2 2 2
Output:
latencies:
[Pytorch (FP32)] mean=4.96ms, sd=0.17ms, min=4.83ms, max=6.84ms, median=4.91ms, 95p=5.18ms, 99p=5.39ms
[Pytorch (FP16)] mean=5.02ms, sd=0.19ms, min=4.86ms, max=7.80ms, median=4.98ms, 95p=5.25ms, 99p=5.44ms
[TensorRT (FP16)] mean=0.87ms, sd=0.19ms, min=0.76ms, max=2.75ms, median=0.77ms, 95p=1.19ms, 99p=1.20ms
[ONNX Runtime (FP32)] mean=1.98ms, sd=0.25ms, min=1.90ms, max=3.58ms, median=1.92ms, 95p=3.02ms, 99p=3.03ms
[ONNX Runtime (optimized)] mean=0.93ms, sd=0.01ms, min=0.91ms, max=1.03ms, median=0.93ms, 95p=0.96ms, 99p=0.97ms
Each infence engine output is within 0.3 tolerance compared to Pytorch output
2) Spin up triton inference server:
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size=16g -v $PWD/triton_models:/models els-rd/transformer:tritron-infer bash -c "tritonserver --model-repository=/models"
Output:
I0729 14:54:21.752279 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001
I0729 14:54:21.752521 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000
I0729 14:54:21.794053 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
3) Query Triton with simple http request script (based on "triton_client.py" in infinity demo)
import numpy as np
import tritonclient.http
if __name__ == "__main__":
model_name = f"transformer_onnx_inference"
url = "127.0.0.1:8000"
model_version = "1"
batch_size = 2
text = ["some text", "some text of different size"]
triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)
query = tritonclient.http.InferInput(name="TEXT", shape=(batch_size,), datatype="BYTES")
model_score = tritonclient.http.InferRequestedOutput(name="output", binary_data=False)
time_buffer = list()
query.set_data_from_numpy(np.asarray(text, dtype=object))
response = triton_client.infer(
model_name=model_name, model_version=model_version, inputs=[query], outputs=[model_score]
)
print(response.as_numpy("output"))
Python Script Output:
Traceback (most recent call last):
File "/root/Code/faiss_sandbox/triton/triton_test.py", line 23, in <module>
response = triton_client.infer(
File "/opt/conda/lib/python3.8/site-packages/tritonclient/http/__init__.py", line 1418, in infer
_raise_if_error(response)
File "/opt/conda/lib/python3.8/site-packages/tritonclient/http/__init__.py", line 65, in _raise_if_error
raise error
tritonclient.utils.InferenceServerException: in ensemble 'transformer_onnx_inference', Failed to process the request(s) for model instance 'transformer_onnx_tokenize_0', message: ValueError: setting an array element with a sequence.
At:
/models/transformer_onnx_tokenize/1/model.py(60): <dictcomp>
/models/transformer_onnx_tokenize/1/model.py(60): execute
Triton Server Output:
/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py:707: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
tensor = as_tensor(value)
0729 14:59:06.143282 360 pb_stub.cc:419] Failed to process the request(s) for model 'transformer_onnx_tokenize_0', message: ValueError: setting an array element with a sequence.
At:
/models/transformer_onnx_tokenize/1/model.py(60): <dictcomp>
/models/transformer_onnx_tokenize/1/model.py(60): execute
Please note that if the variable "text" in the above python script is set to ["some text", "some text"], the request completes as expected:
Output:
[[-0.00714493 0.01116943 -0.01734924 ... -0.01394653 -0.03530884
0.05587769]
[-0.00714493 0.01116943 -0.01734924 ... -0.01394653 -0.03530884
0.05587769]]
It appeared that it is caused by a change of behavior in the HF tokenizer, now padding is disabled by default!
To fix, you need to modify the generated tokenizer script and add padding=True
when the tokenizer is called.
tokens = self.tokenizer(text=query, return_tensors=TensorType.NUMPY, padding=True)
Will push a fix in the updated version of this package.
this worked well, thanks !
When calling embedding task say in the infinity demos, a batch is created by"[text] * batch_size" which works. However, this simply duplicates the same text "batch_size" times.
If a batch is created with containing strings of different sizes input = grpcclient.InferInput(name="TEXT", shape=(batch_size,), datatype="BYTES") input.set_data_from_numpy(np.asarray(["some text","some longer text"], dtype=object))
triton errors out with:
I0727 15:40:43.721558 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001 I0727 15:40:43.721777 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000 I0727 15:40:43.763422 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002 /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py:707: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. tensor = as_tensor(value) 0727 15:40:46.686280 218 pb_stub.cc:419] Failed to process the request(s) for model 'transformer_tensorrt_tokenize_0', message: ValueError: setting an array element with a sequence.
At: /models/transformer_tensorrt_tokenize/1/model.py(60):
/models/transformer_tensorrt_tokenize/1/model.py(60): execute