ylab604 commented 1 month ago

Thank you for great work!

I have a question about multi stream

is it similar like nvidia triton?or something different?

Jason-xy commented 3 weeks ago

Yes, actrully they are shearing the same technology.

ylab604 commented 3 weeks ago

Yes, actrully they are shearing the same technology.

I attempted to use NVIDIA Triton on the Jetson Orin and only achieved a 10% speed improvement. I am curious about how to achieve the threefold speed increase mentioned in the paper.

Jason-xy commented 3 weeks ago

The extent of acceleration depends on your specific task, and not all scenarios are applicable. For best practices in using GPUs, you can refer to NVIDIA's documentation.

To illustrate with an example from our paper, the previous method involved sequentially executing depth estimation inference tasks on four virtual stereo images. In this case, it required four sequential inferences to obtain a complete depth map for one frame. With the introduction of multi-stream processing, images are fed continuously into the depth estimation task without waiting for the previous image to be processed. This can be compared to the operation of a CPU instruction pipeline, though they are not entirely the same. As a result, this approach completes the processing of four images in less than the time it previously took to process 1.5 images.

If you want to understand the principles of acceleration more clearly, you can use NVIDIA Nsight to analyze the resource utilization timelines for both methods. This will help you identify performance bottlenecks and understand the acceleration mechanism.

ylab604 commented 3 weeks ago

The extent of acceleration depends on your specific task, and not all scenarios are applicable. For best practices in using GPUs, you can refer to NVIDIA's documentation.

To illustrate with an example from our paper, the previous method involved sequentially executing depth estimation inference tasks on four virtual stereo images. In this case, it required four sequential inferences to obtain a complete depth map for one frame. With the introduction of multi-stream processing, images are fed continuously into the depth estimation task without waiting for the previous image to be processed. This can be compared to the operation of a CPU instruction pipeline, though they are not entirely the same. As a result, this approach completes the processing of four images in less than the time it previously took to process 1.5 images.

If you want to understand the principles of acceleration more clearly, you can use NVIDIA Nsight to analyze the resource utilization timelines for both methods. This will help you identify performance bottlenecks and understand the acceleration mechanism.

I attempted to implement multi-stream with PyCUDA. Could you check if this attempt is correct? Thank you for your kind response

`

ylab604 commented 3 weeks ago

`import time import cv2 import numpy as np import tensorrt as trt import trt_common from cuda import cudart

import torch

from cuda import cudart import pycuda.driver as cuda import pycuda.autoinit from multiprocessing import Process, Queue from pycuda.compiler import SourceModule

TRT_LOGGER = trt.Logger(trt.Logger.WARNING) trt.init_libnvinfer_plugins(TRT_LOGGER, "")

def get_engine(engine_file_path): print(f"\033[32mReading engine from file {engine_file_path}\033[0m") with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime: return runtime.deserialize_cuda_engine(f.read())

def main(engine_path: str, input_height: int, input_width: int, left_image: str, right_image: str, output_path: str):

engine1 = get_engine(engine_path)
engine2 = get_engine(engine_path)
engine3 = get_engine(engine_path)
engine4 = get_engine(engine_path)

context1 = engine1.create_execution_context()
context2 = engine2.create_execution_context()
context3 = engine3.create_execution_context()
context4 = engine4.create_execution_context()

left = cv2.resize(cv2.cvtColor(cv2.imread(left_image), cv2.COLOR_BGR2RGB), (input_width, input_height)).astype(np.float32) / 255.0
right = cv2.resize(cv2.cvtColor(cv2.imread(right_image), cv2.COLOR_BGR2RGB), (input_width, input_height)).astype(np.float32) / 255.0

left = np.transpose(left, (2, 0, 1))[np.newaxis, :, :, :]
right = np.transpose(right, (2, 0, 1))[np.newaxis, :, :, :]

input_data1 = np.concatenate([left,right],1)
input_data2 = np.concatenate([left, right], 1)
input_data3 = np.concatenate([left, right], 1)
input_data4 = np.concatenate([left, right], 1)
input_data = [input_data1,input_data2,input_data3,input_data4]

stream1 = cuda.Stream()
stream2 = cuda.Stream()
stream3 = cuda.Stream()
stream4 = cuda.Stream()

inputs1, outputs1, bindings1 = trt_common.allocate_buffers(engine1)

for _ in range(100):
    t = time.time()
    # 메모리 할당

    inputs1[0].host = np.ascontiguousarray(input_data)

    a=inputs1[0].host

    [cuda.memcpy_htod_async(inp.device, a[0], stream1) for inp in inputs1]
    [cuda.memcpy_htod_async(inp.device, a[1], stream2) for inp in inputs1]
    [cuda.memcpy_htod_async(inp.device, a[2], stream3) for inp in inputs1]
    [cuda.memcpy_htod_async(inp.device, a[3], stream4) for inp in inputs1]

    # Run inference.
    context1.execute_async_v2(bindings=bindings1, stream_handle=stream1.handle)
    context2.execute_async_v2(bindings=bindings1, stream_handle=stream2.handle)
    context3.execute_async_v2(bindings=bindings1, stream_handle=stream3.handle)
    context4.execute_async_v2(bindings=bindings1, stream_handle=stream4.handle)

    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream1) for out in outputs1]
    [cuda.memcpy_dtoh_async(out.host, out.device, stream2) for out in outputs1]
    [cuda.memcpy_dtoh_async(out.host, out.device, stream3) for out in outputs1]
    [cuda.memcpy_dtoh_async(out.host, out.device, stream4) for out in outputs1]

    outputs0 =[out.host for out in outputs1]
    outputs2 =[out.host for out in outputs1]
    outputs3 =[out.host for out in outputs1]
    outputs4 =[out.host for out in outputs1]

    dt = time.time() - t
    print(f"\033[34mElapsed: {dt:.3f} sec, {1/dt:.3f} FPS\033[0m")

    stream4.synchronize()

    dt = time.time() - t
    # dev = pycuda.autoinit.device
    # print('Concurrent Kernels:', \
    #       bool(dev.get_attribute(cuda.device_attribute.CONCURRENT_KERNELS)))
    print(f"\033[34mElapsed: {dt:.3f} sec, {1/dt:.3f} FPS\033[0m")

disp1 = outputs0[0].reshape(2,input_height, input_width)
disp2 = outputs2[0].reshape(2,input_height, input_width)
disp3 = outputs3[0].reshape(2,input_height, input_width)
disp4 = outputs4[0].reshape(2,input_height, input_width)

disp1 = disp1[0]
disp2 = disp2[0]
disp3 = disp3[0]
disp4 = disp4[0]

norm1 = ((disp1 - disp1.min()) / (disp1.max() - disp1.min()) * 255).astype(np.uint8)
norm2 = ((disp2 - disp2.min()) / (disp2.max() - disp2.min()) * 255).astype(np.uint8)
norm3 = ((disp3 - disp3.min()) / (disp3.max() - disp3.min()) * 255).astype(np.uint8)
norm4 = ((disp4 - disp4.min()) / (disp4.max() - disp4.min()) * 255).astype(np.uint8)

colored1 = cv2.applyColorMap(norm1, cv2.COLORMAP_PLASMA)
colored2 = cv2.applyColorMap(norm2, cv2.COLORMAP_PLASMA)
colored3 = cv2.applyColorMap(norm3, cv2.COLORMAP_PLASMA)
colored4 = cv2.applyColorMap(norm4, cv2.COLORMAP_PLASMA)

cv2.imwrite("output1.png", colored1)
cv2.imwrite("output2.png", colored2)
cv2.imwrite("output3.png", colored3)
cv2.imwrite("output4.png", colored4)
print(f"\033[32moutput: {output_path}\033[0m")

if name == "main": import argparse parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument( "-e", "--engine_path", type=str, default="model.trt", help="TensorRT engine file path.") parser.add_argument( "-ih", "--input_height", type=int, default=240, help="Model input height.") parser.add_argument( "-iw", "--input_width", type=int, default=320, help="Model input width.") parser.add_argument( "-l", "--left_image", type=str, default="data/left.png", help="input left image.") parser.add_argument( "-r", "--right_image", type=str, default="data/right.png", help="input right image.") parser.add_argument( "-o", "--output_path", type=str, default="output.png", help="output colored disparity image paht.") args = parser.parse_args()

main(
    args.engine_path, args.input_height, args.input_width,
    args.left_image, args.right_image, args.output_path
)

`

ylab604 commented 3 weeks ago

trt_common.py

import argparse import os

import numpy as np import pycuda.autoinit import pycuda.driver as cuda import tensorrt as trt

try:

Sometimes python does not understand FileNotFoundError

FileNotFoundError

except NameError: FileNotFoundError = IOError

EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)

def GiB(val): return val * 1 << 30

def add_help(description): parser = argparse.ArgumentParser(description=description, formatterclass=argparse.ArgumentDefaultsHelpFormatter) args, = parser.parse_known_args()

def find_sample_data(description="Runs a TensorRT Python sample", subfolder="", find_files=[], err_msg=""): ''' Parses sample arguments.

Args:
    description (str): Description of the sample.
    subfolder (str): The subfolder containing data relevant to this sample
    find_files (str): A list of filenames to find. Each filename will be replaced with an absolute path.

Returns:
    str: Path of data directory.
'''

# Standard command-line arguments for all samples.
kDEFAULT_DATA_ROOT = os.path.join(os.sep, "usr", "src", "tensorrt", "data")
parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("-d", "--datadir", help="Location of the TensorRT sample data directory, and any additional data directories.", action="append", default=[kDEFAULT_DATA_ROOT])
args, _ = parser.parse_known_args()

def get_data_path(data_dir):
    # If the subfolder exists, append it to the path, otherwise use the provided path as-is.
    data_path = os.path.join(data_dir, subfolder)
    if not os.path.exists(data_path):
        if data_dir != kDEFAULT_DATA_ROOT:
            print("WARNING: " + data_path + " does not exist. Trying " + data_dir + " instead.")
        data_path = data_dir
    # Make sure data directory exists.
    if not (os.path.exists(data_path)) and data_dir != kDEFAULT_DATA_ROOT:
        print("WARNING: {:} does not exist. Please provide the correct data path with the -d option.".format(data_path))
    return data_path

data_paths = [get_data_path(data_dir) for data_dir in args.datadir]
return data_paths, locate_files(data_paths, find_files, err_msg)

def locate_files(data_paths, filenames, err_msg=""): """ Locates the specified files in the specified data directories. If a file exists in multiple data directories, the first directory is used.

Args:
    data_paths (List[str]): The data directories.
    filename (List[str]): The names of the files to find.

Returns:
    List[str]: The absolute paths of the files.

Raises:
    FileNotFoundError if a file could not be located.
"""
found_files = [None] * len(filenames)
for data_path in data_paths:
    # Find all requested files.
    for index, (found, filename) in enumerate(zip(found_files, filenames)):
        if not found:
            file_path = os.path.abspath(os.path.join(data_path, filename))
            if os.path.exists(file_path):
                found_files[index] = file_path

# Check that all files were found
for f, filename in zip(found_files, filenames):
    if not f or not os.path.exists(f):
        raise FileNotFoundError("Could not find {:}. Searched in data paths: {:}\n{:}".format(filename, data_paths, err_msg))
return found_files

Simple helper data class that's a little nicer to use than a 2-tuple.

class HostDeviceMem(object): def init(self, host_mem, device_mem): self.host = host_mem self.device = device_mem

def __str__(self):
    return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

def __repr__(self):
    return self.__str__()

def allocate_buffers(engine): inputs = [] outputs = [] bindings = [] stream = cuda.Stream()

for binding in engine:
    # print(binding) input output
    # print("check check")
    size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
    # size = trt.volume(engine.get_binding_shape(binding)) * 4
    dtype = trt.nptype(engine.get_binding_dtype(binding))
    # Allocate host and device buffers
    host_mem = cuda.pagelocked_empty(size, dtype)
    device_mem = cuda.mem_alloc(host_mem.nbytes)
    # Append the device buffer to device bindings.
    bindings.append(int(device_mem))
    # Append to the appropriate list.
    if engine.binding_is_input(binding):
        inputs.append(HostDeviceMem(host_mem, device_mem))
    else:
        outputs.append(HostDeviceMem(host_mem, device_mem))

return inputs, outputs, bindings, stream`

def do_inference_v2(context, bindings, inputs, outputs, stream):

Transfer input data to the GPU.

[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
# stream.synchronize()
# Return only the host outputs.
return [out.host for out in outputs]

`

HKUST-Aerial-Robotics / OmniNxt

Multi stream #6

import torch

Sometimes python does not understand FileNotFoundError

Simple helper data class that's a little nicer to use than a 2-tuple.

Transfer input data to the GPU.