Open AmazDeng opened 1 week ago
what is your trt infer code ?
what is your trt infer code ?
infer code like this. I didn't set up any multi-process or multi-threaded inference operations in the code.:
from openclip_trt.tensorrt_utils import TensorRTModel
txt_trt_model_path="/media/star/8T/model/clip/open_clip/tensorrt-8.6.1/a100/trt/python/batch_dynamic/ViT-bigG-14.txt.fp32.trt.engine"
txt_trt_model = TensorRTModel(txt_trt_model_path)
texts = [
"NVIDIA TensorRT is an SDK that facilitates high-performance machine learning inference",
"It is designed to work in a complementary fashion with training frameworks such as TensorFlow, PyTorch, and MXNet.",
"It focuses specifically on running an already-trained network quickly and efficiently on NVIDIA hardware.",
"hello world",
"Xi Jinping Thou ght on Culture reveals the outstanding characteristics of Chinese civilization and discusses the theories, principles and philosophy of cultural exchanges.",
"According to Xi Jinping Thought on Culture, civilizational exchanges can transcend barriers and conflicts, and inter-civilizational interactions can boost the harmonious development of civilizations",
"No civilization can exist independently, or by refusing to interact with other civilizations",
"The coexistence of and exchanges between civilizations are the norm, with all civilizations moving toward a harmonious future.",
"Marxism reveals the characteristics of human civilization.",
"Science and technology play a fundamental role in transforming agriculture and enhancing food security",
"Sun said small-scale farming is common in both China and many African countries",
"The academy cooperates with 23 African countries and nine international organizations",
"By helping to build biogas facilities and conduct technology demonstrations in countries such as Tanzania, Mauritania and Angola, the academy has supported the adoption of renewable energy sources and promoted resource efficiency in agricultural production.",
"Rather than looking to other regions with different contexts, it would be more beneficial for African nations to glean insights and experience from China's journey, given the shared historical challenges and the success China has achieved"
]
tokenizer = open_clip.get_tokenizer("ViT-bigG-14")
texts_token=tokenizer(texts).cuda()
trt_text_features = txt_trt_model(inputs={'text': texts_token})['unnorm_text_features']
TensorRTModel
class TensorRTModel(object):
def __init__(self, engine_path):
print(f'load engine_path is {engine_path}')
self.engine = self.load_engine(engine_path)
assert self.engine
profile_index = 0
self.context = self.engine.create_execution_context()
self.context.set_optimization_profile_async(
profile_index=profile_index, stream_handle=torch.cuda.current_stream().cuda_stream
)
self.input_binding_idxs, self.output_binding_idxs = get_binding_idxs(self.engine, profile_index)
def load_engine(self, engine_file_path):
assert os.path.exists(engine_file_path)
print("Reading engine from file {}".format(engine_file_path))
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
return runtime.deserialize_cuda_engine(f.read())
# with open(engine_file_path, "rb") as f, trt.Runtime(trt.Logger(trt.Logger.ERROR)) as runtime:
# engine = runtime.deserialize_cuda_engine(f.read())
# return engine
def __call__(self, inputs, time_buffer=None):
input_tensors: List[torch.Tensor] = list()
for i in range(self.context.engine.num_bindings):
if not self.context.engine.binding_is_input(index=i):
continue
tensor_name = self.context.engine.get_binding_name(i)
assert tensor_name in inputs, f"input not provided: {tensor_name}"
tensor = inputs[tensor_name]
assert isinstance(tensor, torch.Tensor), f"unexpected tensor class: {type(tensor)}"
assert tensor.device.type == "cuda", f"unexpected device type (trt only works on CUDA): {tensor.device.type}"
# warning: small changes in output if int64 is used instead of int32
if tensor.dtype in [torch.int64, torch.long]:
# logging.warning(f"using {tensor.dtype} instead of int32 for {tensor_name}, will be casted to int32")
tensor = tensor.type(torch.int32)
input_tensors.append(tensor)
# calculate input shape, bind it, allocate GPU memory for the output
outputs: Dict[str, torch.Tensor] = get_output_tensors(
self.context, input_tensors, self.input_binding_idxs, self.output_binding_idxs
)
bindings = [int(i.data_ptr()) for i in input_tensors + list(outputs.values())]
if time_buffer is None:
self.context.execute_v2(bindings=bindings)
else:
with track_infer_time(time_buffer):
self.context.execute_v2(bindings=bindings)
torch.cuda.current_stream().synchronize() # sync all CUDA ops
return outputs
If your all requests are send to one process(include trt infer) is no problem.
Description
I compiled the image part of the open_clip model (a PyTorch model,https://github.com/mlfoundations/open_clip) in a Python environment using TensorRT 8.6.1, and obtained an engine. Then, I developed a service that loads the TensorRT engine, accepts HTTP POST requests, performs inference, and returns results. This service is written in Python, not C++. Here are the phenomena I observed:
Environment
TensorRT Version: python==3.8,tensorrt==8.6.1
NVIDIA GPU:A 100,80G
NVIDIA Driver Version:535.54.03
CUDA Version:12.2
CUDNN Version:
Operating System:ubuntu 20.04
Python Version (if applicable):3.8
Tensorflow Version (if applicable):
PyTorch Version (if applicable):1.13.1+cu116
Baremetal or Container (if so, version):
Relevant Files
Model link:https://github.com/mlfoundations/open_clip
Steps To Reproduce
Commands or scripts:
Have you tried the latest release?:No
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
):