NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.62k stars 2.11k forks source link

Facing issue while running Flask app with TensorRt model on jetson nano #475

Closed Akshaysharma29 closed 4 years ago

Akshaysharma29 commented 4 years ago

Description

I have an inference code in TensorRT(with python). I want to run this code in Flask but I get the below error when trying to allocate buffer: Debugging middleware caught exception in streamed response at a point where response headers were already sent. Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/werkzeug/wsgi.py", line 506, in next return self._next() File "/usr/local/lib/python3.6/dist-packages/werkzeug/wrappers/base_response.py", line 45, in _iter_encoded for item in iterable: File "/home/jetson-alpha/Desktop/video_streamming_tensorRt/video.py", line 89, in gen inputs, outputs, bindings, stream = allocate_buffers(engine) # input, output: host # bindings File "/home/jetson-alpha/Desktop/video_streamming_tensorRt/config.py", line 23, in allocate_buffers stream = cuda.Stream() pycuda._driver.LogicError: explicit_context_dependent failed: invalid device context - no currently active context?

The code works well on jupyter notebook.

Environment

TensorRT Version: 6.0.1.10 device Type: jetson nano CUDA Version: 10 Python Version (if applicable): 3.6 PyTorch Version (if applicable): 1.1.0

Relevant Files

Steps To Reproduce

Akshaysharma29 commented 4 years ago

Solution: https://stackoverflow.com/questions/61056832/facing-issue-while-running-flask-app-with-tensorrt-model-on-jetson-nano#comment108055562_61061543

weixiaolian21 commented 4 years ago

Solution: https://stackoverflow.com/questions/61056832/facing-issue-while-running-flask-app-with-tensorrt-model-on-jetson-nano#comment108055562_61061543

There is nothing when open this url, I have the same issue,Could you share the way to solve this issue?

Akshaysharma29 commented 4 years ago

Hi @weixiaolian21, I have not completely solved this issue. There is some threading issue of worker_thread. which can be solved using callback function as used in this link https://stackoverflow.com/questions/61223028/flask-app-is-keep-on-loading-at-the-time-of-predictiontensorrt

but then new issue occur. if you are able to solve this then share the approach. Thanks.

nik13 commented 4 years ago

Facing same issue, found any solution?

bobbilichandu commented 3 years ago

@Akshaysharma29 @weixiaolian21 @nik13 , were you able to solve this issue?

bobbilichandu commented 3 years ago

@Akshaysharma29 @weixiaolian21 @nik13 , were you able to solve this issue?

jkjung-avt commented 3 years ago

I think this issue could be resolved by wrapping the TensorRT inference function (i.e. execute_async or execute_async_v2) with pushing/poping of the default CUDA context.

Reference: https://github.com/jkjung-avt/tensorrt_demos/issues/213#issuecomment-691868942

https://github.com/jkjung-avt/tensorrt_demos/blob/899770162fecc21475db38d670b5b3467874046e/utils/yolo_with_plugins.py#L312-L321

bobbilichandu commented 3 years ago

Thank you. I have one doubt, won't the pushing and popping context for every inference increased inference time? Also, can this be used for scaledyolov4?

jkjung-avt commented 3 years ago

I have one doubt, won't the pushing and popping context for every inference increased inference time?

Based on my tests on Jetson nano, the overhead of cuda contect pushing/poping is negligible.

Also, can this be used for scaledyolov4?

My TensorRT YOLOv4 implementation does support Scaled-YOLOv4. More specifically, the code supports darknet "yolov4-csp" and "yolov4x-mish" models out of the box.

CoinCheung commented 3 years ago

Hi,

I have seen the above solution, we should first push the context that we defined at the beginning of the program, and pop it after we run inference with context created by the engine.

I have another question about this: what if I have multi-gpu and I need to run inference in parallel on these gpus ? How could I configure my program please ?

I mean:

## The following four inference are executed in parallel on 4 gpus
out1 = trt_infer_on_gpu1(inp)
out2 = trt_infer_on_gpu2(inp)
out3 = trt_infer_on_gpu3(inp)
out4 = trt_infer_on_gpu4(inp)