ELS-RD / transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
https://els-rd.github.io/transformer-deploy/
Apache License 2.0
1.65k stars 150 forks source link

embedding task fails with batch_size > 1 given variable length inputs #117

Closed bbartlett-nv closed 2 years ago

bbartlett-nv commented 2 years ago

When calling embedding task say in the infinity demos, a batch is created by"[text] * batch_size" which works. However, this simply duplicates the same text "batch_size" times.

If a batch is created with containing strings of different sizes input = grpcclient.InferInput(name="TEXT", shape=(batch_size,), datatype="BYTES") input.set_data_from_numpy(np.asarray(["some text","some longer text"], dtype=object))

triton errors out with:

I0727 15:40:43.721558 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001 I0727 15:40:43.721777 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000 I0727 15:40:43.763422 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002 /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py:707: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. tensor = as_tensor(value) 0727 15:40:46.686280 218 pb_stub.cc:419] Failed to process the request(s) for model 'transformer_tensorrt_tokenize_0', message: ValueError: setting an array element with a sequence.

At: /models/transformer_tensorrt_tokenize/1/model.py(60): /models/transformer_tensorrt_tokenize/1/model.py(60): execute

pommedeterresautee commented 2 years ago

Thanks @bbartlett-nv for the question. Just to clarify, when you are generating texts of different sizes it doesn't work?

If you can provide some end to end reproducible example it would be helpful.

Moreover, below is a page related to how to gen packet to send to Triton: https://github.com/ELS-RD/transformer-deploy/blob/main/docs/run.md#query-the-inference-server

bbartlett-nv commented 2 years ago

thanks for the prompt response! I will try to explain a reproduceable example:

1) Run the following from within ghcr.io/els-rd/transformer-deploy:0.4.0 convert.py -m sentence-transformers/msmarco-distilbert-cos-v5 --backend onnx tensorrt --task embedding --seq-len 2 128 256 --batch-size 2 2 2

Output:

    latencies:
    [Pytorch (FP32)] mean=4.96ms, sd=0.17ms, min=4.83ms, max=6.84ms, median=4.91ms, 95p=5.18ms, 99p=5.39ms
    [Pytorch (FP16)] mean=5.02ms, sd=0.19ms, min=4.86ms, max=7.80ms, median=4.98ms, 95p=5.25ms, 99p=5.44ms
    [TensorRT (FP16)] mean=0.87ms, sd=0.19ms, min=0.76ms, max=2.75ms, median=0.77ms, 95p=1.19ms, 99p=1.20ms
    [ONNX Runtime (FP32)] mean=1.98ms, sd=0.25ms, min=1.90ms, max=3.58ms, median=1.92ms, 95p=3.02ms, 99p=3.03ms
    [ONNX Runtime (optimized)] mean=0.93ms, sd=0.01ms, min=0.91ms, max=1.03ms, median=0.93ms, 95p=0.96ms, 99p=0.97ms
    Each infence engine output is within 0.3 tolerance compared to Pytorch output

2) Spin up triton inference server: docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size=16g -v $PWD/triton_models:/models els-rd/transformer:tritron-infer bash -c "tritonserver --model-repository=/models"

Output:

    I0729 14:54:21.752279 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001
    I0729 14:54:21.752521 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000
    I0729 14:54:21.794053 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002

3) Query Triton with simple http request script (based on "triton_client.py" in infinity demo)

import numpy as np
import tritonclient.http

if __name__ == "__main__":

    model_name = f"transformer_onnx_inference"
    url = "127.0.0.1:8000"
    model_version = "1"
    batch_size = 2

    text = ["some text", "some text of different size"]

    triton_client = tritonclient.http.InferenceServerClient(url=url, verbose=False)
    query = tritonclient.http.InferInput(name="TEXT", shape=(batch_size,), datatype="BYTES")
    model_score = tritonclient.http.InferRequestedOutput(name="output", binary_data=False)
    time_buffer = list()

    query.set_data_from_numpy(np.asarray(text, dtype=object))
    response = triton_client.infer(
        model_name=model_name, model_version=model_version, inputs=[query], outputs=[model_score]
    )

    print(response.as_numpy("output"))

Python Script Output:

Traceback (most recent call last):
      File "/root/Code/faiss_sandbox/triton/triton_test.py", line 23, in <module>
        response = triton_client.infer(
      File "/opt/conda/lib/python3.8/site-packages/tritonclient/http/__init__.py", line 1418, in infer
        _raise_if_error(response)
      File "/opt/conda/lib/python3.8/site-packages/tritonclient/http/__init__.py", line 65, in _raise_if_error
        raise error
    tritonclient.utils.InferenceServerException: in ensemble 'transformer_onnx_inference', Failed to process the request(s) for model instance 'transformer_onnx_tokenize_0', message: ValueError: setting an array element with a sequence.

    At:
      /models/transformer_onnx_tokenize/1/model.py(60): <dictcomp>
      /models/transformer_onnx_tokenize/1/model.py(60): execute

Triton Server Output:

    /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py:707: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
      tensor = as_tensor(value)
    0729 14:59:06.143282 360 pb_stub.cc:419] Failed to process the request(s) for model 'transformer_onnx_tokenize_0', message: ValueError: setting an array element with a sequence.

    At:
      /models/transformer_onnx_tokenize/1/model.py(60): <dictcomp>
      /models/transformer_onnx_tokenize/1/model.py(60): execute

Please note that if the variable "text" in the above python script is set to ["some text", "some text"], the request completes as expected:

Output:

    [[-0.00714493  0.01116943 -0.01734924 ... -0.01394653 -0.03530884
       0.05587769]
     [-0.00714493  0.01116943 -0.01734924 ... -0.01394653 -0.03530884
       0.05587769]]
pommedeterresautee commented 2 years ago

It appeared that it is caused by a change of behavior in the HF tokenizer, now padding is disabled by default! To fix, you need to modify the generated tokenizer script and add padding=True when the tokenizer is called.

tokens = self.tokenizer(text=query, return_tensors=TensorType.NUMPY, padding=True)

Will push a fix in the updated version of this package.

bbartlett-nv commented 2 years ago

this worked well, thanks !