IRCVLab / Depth-Anything-for-Jetson-Orin

Real-time Depth Estimation for Jetson Orin
12 stars 5 forks source link

Slow TensorRT Inference Speed on Jetson Orin NX #1

Open zzzzzyh111 opened 3 weeks ago

zzzzzyh111 commented 3 weeks ago

Thank you for your excellent work! :satisfied: :satisfied: :satisfied:

Recently, I have been trying to use TensorRT to accelerate Depth Anything on Jetson Orin NX. However, I found that the inference speed of the converted TRT file does not significantly improve compared to the ONNX file, and it even decreases. Specifically:

ONNX Inference Time: 2.7s per image
TRT Inference Time: 3.0s per image

The library versions are as follows:

- JetPack: 5.1
- CUDA: 11.4.315
- cuDNN: 8.6.0.166
- TensorRT: 8.5.2.2
- VPI: 2.2.4
- Vulkan: 1.3.204
- OpenCV: 4.5.4 - with CUDA: NO
- torch: 2.1.0
- torchvision: 0.16.0
- onnx: 1.16.1
- onnxruntime: 1.8.0

The function to convert the .pth file to an ONNX file is as follows:

model_name = "zoedepth"
pretrained_resource = "local::./checkpoints/ZoeDepthIndoor_05-Jun_15-11-ebbebc6c1002_best.pt"
dataset = None
overwrite = {"pretrained_resource": pretrained_resource}
config = get_config(model_name, "eval", dataset, **overwrite)
model = build_model(config)
model.eval() 
dummy_input = torch.randn(1, 3, 392, 518)
 _ = model(dummy_input)
torch.onnx.export(model, dummy_input, "ZoeDepth_indoor.onnx", verbose=True)
torch.onnx.export(
            model,
             dummy_input, 
             "./checkpoints/ZoeDepth_indoor_jetson.onnx", 
             opset_version=11, 
             input_names=["input"], 
             output_names=["output"], 
)

The function to convert the ONNX file to a TRT file is as follows:

def build_engine(onnx_file_path):
    onnx_file_path = Path(onnx_file_path)
    # ONNX to TensorRT
    logger = trt.Logger(trt.Logger.VERBOSE)
    builder = trt.Builder(logger)
    network = builder.create_network(1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, logger)

    with open(onnx_file_path, "rb") as model:
        if not parser.parse(model.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            raise ValueError('Faled to parse the ONNX model.')

    # Set up the builder config
    config = builder.create_builder_config()
    config.set_flag(trt.BuilderFlag.FP16)  # FP16
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 2 << 30)  # 2 GB

    serialized_engine = builder.build_serialized_network(network, config)

    with open(onnx_file_path.with_suffix(".trt"), "wb") as f:
        f.write(serialized_engine)

The function to perform inference using the TRT file is as follows:

def infer_trt(engine, input_image):
    input_image = input_image.cpu().numpy().astype(np.float32)
    context = engine.create_execution_context()
    height, width = input_image.shape[2], input_image.shape[3]
    output_shape = (1, 1, height, width)
    # Allocate pagelocked memory
    h_input = cuda.pagelocked_empty(trt.volume((1, 3, height, width)), dtype=np.float32)
    h_output = cuda.pagelocked_empty(trt.volume((1, 1, height, width)), dtype=np.float32)

    # Allocate device memory
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)

    bindings = [int(d_input), int(d_output)]
    stream = cuda.Stream()
    # Function to perform inference
    def perform_inference(images_np):
        np.copyto(h_input, images_np.ravel())
        cuda.memcpy_htod_async(d_input, h_input, stream)
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
        cuda.memcpy_dtoh_async(h_output, d_output, stream)
        stream.synchronize()
        return torch.tensor(h_output).view(output_shape)
        # Run inference on original images

    pred1 = perform_inference(input_image)

    # Run inference on flipped images
    flipped_images_np = np.flip(input_image, axis=3)
    pred2 = perform_inference(flipped_images_np)
    pred2 = torch.flip(pred2, [3])
    mean_pred = 0.5 * (pred1 + pred2)
    return mean_pred

The code runs without any issues, except for some warnings during the ONNX conversion. However, the final results are still not satisfactory. Looking forward to your response! :heart: :heart: :heart:

aarchiiive commented 2 weeks ago

I am aware that Python-based TensorRT implementations are significantly slower than those based on C++. I have already made considerable progress on this and plan to open-source it on GitHub soon. Initially, I will be starting with YOLOv8, though I'm not yet certain if it will cover other models. Models with 10M parameters or fewer are expected to have an inference time under 10ms. Currently, YOLOv8n measures approximately 2ms. From what I can see, there doesn't appear to be any issues with your code. If you have cv2.imshow enabled or any other unnecessary operations, try commenting those out and running it again. Thank you for your interest, and I appreciate the effort you've put into customizing the code! Sorry for the delayed response. I will make sure to keep a closer eye on issues going forward.

If you are interested in TensorRT, I recommend checking out the following repositories:

zzzzzyh111 commented 1 week ago

Thanks for your prompt reply! I will try to deploy it based on C++ and find any potential impacts on my inference speed. I will update you as soon as I have any new information.