NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.15k stars 2.08k forks source link

Concurrent inference failure of TensorRT 8.6.1 when running open_clip visual model tensorrt engine on GPU A100 #3967

Open AmazDeng opened 1 week ago

AmazDeng commented 1 week ago

Description

I compiled the image part of the open_clip model (a PyTorch model,https://github.com/mlfoundations/open_clip) in a Python environment using TensorRT 8.6.1, and obtained an engine. Then, I developed a service that loads the TensorRT engine, accepts HTTP POST requests, performs inference, and returns results. This service is written in Python, not C++. Here are the phenomena I observed:

  1. When I send only one request at a time, the service functions normally, and the model can infer and return results correctly.
  2. When I send 5 requests at the same time (concurrent requests using a Python process pool), the model errors out. From what I've read, it seems that the TensorRT engine is not thread-safe during concurrent inference. What should I do to enable the model to support concurrent requests?

1

# post_request_process.py
from multiprocessing import Pool
from typing import Dict, List
from tqdm import tqdm
import json
import requests

def post_req(param_dict: Dict):
    url = param_dict['url']
    json_data = param_dict["json_data"]
    headers=param_dict["headers"]
    res = requests.post(url, headers=headers, json=json_data).text
    return res

def multi_process(url: str, json_data: Dict,headers:Dict[str,str]):
    param_list = [{"url": url, "json_data": json_data,"headers":headers}] * 10

    pool = Pool(processes=5)
    tqdm_kwargs = dict(total=len(param_list), desc=f'cal video total time')

    res_list: List = []
    for res in tqdm(pool.imap_unordered(post_req, param_list), **tqdm_kwargs):
        res_list.append(res)
    pool.close()
    pool.join()
    return res_list

url = 'http://192.168.0.198:8001'
with open('/home/dengxiaoyu/PycharmProjects/rxzn/eas_demo/open_clip_trt_img/tests/img.json', 'r') as file:
    data = json.load(file)

headers={"content-type": "application/json"}
res_list = multi_process(url, data, headers)

error_cnt=0
for i,res in enumerate(res_list):
    print(f"res={res}")

Environment

TensorRT Version: python==3.8,tensorrt==8.6.1

NVIDIA GPU:A 100,80G

NVIDIA Driver Version:535.54.03

CUDA Version:12.2

CUDNN Version:

Operating System:ubuntu 20.04

Python Version (if applicable):3.8

Tensorflow Version (if applicable):

PyTorch Version (if applicable):1.13.1+cu116

Baremetal or Container (if so, version):

Relevant Files

Model link:https://github.com/mlfoundations/open_clip

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:No

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

lix19937 commented 6 days ago

what is your trt infer code ?

AmazDeng commented 5 days ago

what is your trt infer code ?

infer code like this. I didn't set up any multi-process or multi-threaded inference operations in the code.:

from openclip_trt.tensorrt_utils import TensorRTModel

txt_trt_model_path="/media/star/8T/model/clip/open_clip/tensorrt-8.6.1/a100/trt/python/batch_dynamic/ViT-bigG-14.txt.fp32.trt.engine"

txt_trt_model = TensorRTModel(txt_trt_model_path)
texts = [
                  "NVIDIA TensorRT is an SDK that facilitates high-performance machine learning inference",
                  "It is designed to work in a complementary fashion with training frameworks such as TensorFlow, PyTorch, and MXNet.",
                  "It focuses specifically on running an already-trained network quickly and efficiently on NVIDIA hardware.",
                  "hello world",
                  "Xi Jinping Thou  ght on Culture reveals the outstanding characteristics of Chinese civilization and discusses the theories, principles and philosophy of cultural exchanges.",
                  "According to Xi Jinping Thought on Culture, civilizational exchanges can transcend barriers and conflicts, and inter-civilizational interactions can boost the harmonious development of civilizations",
                  "No civilization can exist independently, or by refusing to interact with other civilizations",
                  "The coexistence of and exchanges between civilizations are the norm, with all civilizations moving toward a harmonious future.",
                  "Marxism reveals the characteristics of human civilization.",
                  "Science and technology play a fundamental role in transforming agriculture and enhancing food security",
                  "Sun said small-scale farming is common in both China and many African countries",
                  "The academy cooperates with 23 African countries and nine international organizations",
                  "By helping to build biogas facilities and conduct technology demonstrations in countries such as Tanzania, Mauritania and Angola, the academy has supported the adoption of renewable energy sources and promoted resource efficiency in agricultural production.",
                  "Rather than looking to other regions with different contexts, it would be more beneficial for African nations to glean insights and experience from China's journey, given the shared historical challenges and the success China has achieved"
        ]

tokenizer = open_clip.get_tokenizer("ViT-bigG-14")
texts_token=tokenizer(texts).cuda()

trt_text_features = txt_trt_model(inputs={'text': texts_token})['unnorm_text_features']

TensorRTModel


class TensorRTModel(object):
    def __init__(self, engine_path):
        print(f'load engine_path is {engine_path}')
        self.engine = self.load_engine(engine_path)
        assert self.engine
        profile_index = 0
        self.context = self.engine.create_execution_context()
        self.context.set_optimization_profile_async(
            profile_index=profile_index, stream_handle=torch.cuda.current_stream().cuda_stream
        )
        self.input_binding_idxs, self.output_binding_idxs = get_binding_idxs(self.engine, profile_index)

    def load_engine(self, engine_file_path):
        assert os.path.exists(engine_file_path)
        print("Reading engine from file {}".format(engine_file_path))
        with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
            return runtime.deserialize_cuda_engine(f.read())
        # with open(engine_file_path, "rb") as f, trt.Runtime(trt.Logger(trt.Logger.ERROR)) as runtime:
        #     engine = runtime.deserialize_cuda_engine(f.read())
        #     return engine

    def __call__(self, inputs, time_buffer=None):
        input_tensors: List[torch.Tensor] = list()
        for i in range(self.context.engine.num_bindings):
            if not self.context.engine.binding_is_input(index=i):
                continue
            tensor_name = self.context.engine.get_binding_name(i)
            assert tensor_name in inputs, f"input not provided: {tensor_name}"
            tensor = inputs[tensor_name]
            assert isinstance(tensor, torch.Tensor), f"unexpected tensor class: {type(tensor)}"
            assert tensor.device.type == "cuda", f"unexpected device type (trt only works on CUDA): {tensor.device.type}"
            # warning: small changes in output if int64 is used instead of int32
            if tensor.dtype in [torch.int64, torch.long]:
                # logging.warning(f"using {tensor.dtype} instead of int32 for {tensor_name}, will be casted to int32")
                tensor = tensor.type(torch.int32)
            input_tensors.append(tensor)

        # calculate input shape, bind it, allocate GPU memory for the output
        outputs: Dict[str, torch.Tensor] = get_output_tensors(
            self.context, input_tensors, self.input_binding_idxs, self.output_binding_idxs
        )
        bindings = [int(i.data_ptr()) for i in input_tensors + list(outputs.values())]
        if time_buffer is None:
            self.context.execute_v2(bindings=bindings)
        else:
            with track_infer_time(time_buffer):
                self.context.execute_v2(bindings=bindings)

        torch.cuda.current_stream().synchronize()  # sync all CUDA ops

        return outputs
lix19937 commented 4 days ago

If your all requests are send to one process(include trt infer) is no problem.