cudla import external semaphore FAILED

WangFengtu1996 commented 6 months ago

I can run deme in hybrid mode successfully.

When using standalone mode, the error I got cudla import external semaphore FAILED 1

(base) orin@orin-root:~/workspace/cuDLA-samples$ make run USE_DLA_STANDALONE_MODE=1 -j
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -O2 -c -o build/validate_coco.o src/validate_coco.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -O2 -c -o build/cudla_context_standalone.o src/cudla_context_standalone.cpp
g++ --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -O2 -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include  -o ./build/cudla_yolov5_app build/decode_nms.o build/validate_coco.o build/yolov5.o build/cudla_context_hybrid.o build/cudla_context_standalone.o -l cudla -L/usr/local/cuda/lib64 -l cuda -l cudart -l nvinfer -L /usr/local/lib/ -l opencv_objdetect -l opencv_highgui -l opencv_imgproc -l opencv_core -l opencv_imgcodecs -L ./src/matx_reformat/build/ -l matx_reformat -l jsoncpp -L /usr/lib/aarch64-linux-gnu/tegra -l nvscibuf -l nvscisync
././build/cudla_yolov5_app --engine ./data/loadable/yolov5.int8.int8hwc4in.fp16chw16out.standalone.bin --image ./data/images/image.jpg --backend cudla_int8
[standalone mode] create CUDLA device SUCCESS
[standalone mode] load cuDLA module from memory SUCCESS
[standalone mode] get number of input tensors SUCCESS
[standalone mode] numInputTensors = 1
[standalone mode] get number of output tensors SUCCESS
[standalone mode] numOutputTensors = 3
[standalone mode] get input tensor descriptor SUCCESS
[standalone mode] get output tensor descriptor SUCCESS
[standalone mode] Printing inputs tensor descriptor
[standalone mode] Printing output tensor descriptor
[standalone mode] open NvSci buffer module SUCCESS
[standalone mode] -------------------------------------------
[standalone mode] TENSOR NAME : images'
[standalone mode] size: 1806336
[standalone mode] dims: [1, 4, 672, 672]
[standalone mode] data fmt: 2
[standalone mode] data type: 4
[standalone mode] data category: 0
[standalone mode] pixel fmt: 12
[standalone mode] pixel mapping: 0
[standalone mode] create NvSci buffer attr list SUCCESS
[standalone mode] set NvSci buffer attr list SUCCESS
[standalone mode] reconcile NvSciBuf attribute list SUCCESS
[standalone mode] alloc NvSciBuf Obj SUCCESS
[standalone mode] import memory to cudla SUCCESS
[standalone mode] import external memory to cuda SUCCESS
[standalone mode] map external memory to cuda buffer SUCCESS
[standalone mode] -------------------------------------------
[standalone mode] -------------------------------------------
[standalone mode] TENSOR NAME : s8'
[standalone mode] size: 3612672
[standalone mode] dims: [1, 255, 84, 84]
[standalone mode] data fmt: 3
[standalone mode] data type: 2
[standalone mode] data category: 2
[standalone mode] pixel fmt: 36
[standalone mode] pixel mapping: 0
[standalone mode] create NvSci buffer attr list SUCCESS
[standalone mode] set NvSci buffer attr list SUCCESS
[standalone mode] reconcile NvSciBuf attribute list SUCCESS
[standalone mode] alloc NvSciBuf Obj SUCCESS
[standalone mode] import memory to cudla SUCCESS
[standalone mode] import external memory to cuda SUCCESS
[standalone mode] map external memory to cuda buffer SUCCESS
[standalone mode] -------------------------------------------
[standalone mode] -------------------------------------------
[standalone mode] TENSOR NAME : s16'
[standalone mode] size: 903168
[standalone mode] dims: [1, 255, 42, 42]
[standalone mode] data fmt: 3
[standalone mode] data type: 2
[standalone mode] data category: 2
[standalone mode] pixel fmt: 36
[standalone mode] pixel mapping: 0
[standalone mode] create NvSci buffer attr list SUCCESS
[standalone mode] set NvSci buffer attr list SUCCESS
[standalone mode] reconcile NvSciBuf attribute list SUCCESS
[standalone mode] alloc NvSciBuf Obj SUCCESS
[standalone mode] import memory to cudla SUCCESS
[standalone mode] import external memory to cuda SUCCESS
[standalone mode] map external memory to cuda buffer SUCCESS
[standalone mode] -------------------------------------------
[standalone mode] -------------------------------------------
[standalone mode] TENSOR NAME : s32'
[standalone mode] size: 225792
[standalone mode] dims: [1, 255, 21, 21]
[standalone mode] data fmt: 3
[standalone mode] data type: 2
[standalone mode] data category: 2
[standalone mode] pixel fmt: 36
[standalone mode] pixel mapping: 0
[standalone mode] create NvSci buffer attr list SUCCESS
[standalone mode] set NvSci buffer attr list SUCCESS
[standalone mode] reconcile NvSciBuf attribute list SUCCESS
[standalone mode] alloc NvSciBuf Obj SUCCESS
[standalone mode] import memory to cudla SUCCESS
[standalone mode] import external memory to cuda SUCCESS
[standalone mode] map external memory to cuda buffer SUCCESS
[standalone mode] -------------------------------------------
[standalone mode] create NvSci sync module SUCCESS
[standalone mode] create NvSci waiter attr list SUCCESS
[standalone mode] create NvSci signal attr list SUCCESS
[standalone mode] get NvSci waiter sync attributes SUCCESS
[standalone mode] cuda get NvSci signal list SUCCESS
[standalone mode] reconciled NvSci sync attr list SUCCESS
[standalone mode] allocate NvSci sync object SUCCESS
[standalone mode] cudla import external semaphore FAILED in src/cudla_context_standalone.cpp:312, CUDLA ERR: 13
make: *** [Makefile:80: run] Error 1

2yjia commented 6 months ago

I can run both modes, but the inference time for each image is 20ms, which is different from what the experiment says, please ask what is the time of your hybrid mode @WangFengtu1996

WangFengtu1996 commented 6 months ago

@2yjia I can not understand why I can not run successfully in standalone mode. The inference time is about 17ms ~20ms. when warmup is finished, the inference time is shortened. My platform is nvidia jetson AGX ORIN DK. would you give me some guide that inference in standalone mode? thks.

WangFengtu1996 commented 6 months ago

@2yjia 我参考了这个issue https://github.com/NVIDIA-AI-IOT/cuDLA-samples/issues/7 但是，我这边遇到新的问题

py310) orin@orin-root:~/workspace/cuDLA-samples$ make validate_cudla_int8 USE_DLA_STANDALONE_MODE=1  USE_DETERMINISTIC_SEMAPHORE=1 -j
/usr/local/cuda/bin/nvcc -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include -gencode arch=compute_87,code=sm_87 -c -o build/decode_nms.o src/decode_nms.cu
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/validate_coco.o src/validate_coco.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/yolov5.o src/yolov5.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/cudla_context_hybrid.o src/cudla_context_hybrid.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/cudla_context_standalone.o src/cudla_context_standalone.cpp
src/cudla_context_standalone.cpp: In member function ‘void cuDLAContextStandalone::initialize()’:
src/cudla_context_standalone.cpp:324:19: error: ‘NvSciSyncFenceUpdateFence’ was not declared in this scope; did you mean ‘NvSciSyncObjGenerateFence’?
  324 |     m_nvsci_err = NvSciSyncFenceUpdateFence(m_WaitEventContext.sync_obj, m_WaiterID, m_WaiterValue, m_WaitEventContext.nvsci_fence_ptr);
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncObjGenerateFence
src/cudla_context_standalone.cpp:326:19: error: ‘NvSciSyncFenceExtractFence’ was not declared in this scope; did you mean ‘NvSciSyncIpcExportFence’?
  326 |     m_nvsci_err = NvSciSyncFenceExtractFence(m_WaitEventContext.nvsci_fence_ptr,&m_WaiterID,&m_WaiterValue);
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncIpcExportFence
src/cudla_context_standalone.cpp: In member function ‘int cuDLAContextStandalone::submitDLATask(cudaStream_t)’:
src/cudla_context_standalone.cpp:443:19: error: ‘NvSciSyncFenceExtractFence’ was not declared in this scope; did you mean ‘NvSciSyncIpcExportFence’?
  443 |     m_nvsci_err = NvSciSyncFenceExtractFence(m_WaitEventContext.nvsci_fence_ptr ,&m_WaiterID, &m_WaiterValue);
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncIpcExportFence
src/cudla_context_standalone.cpp:445:19: error: ‘NvSciSyncFenceUpdateFence’ was not declared in this scope; did you mean ‘NvSciSyncObjGenerateFence’?
  445 |     m_nvsci_err = NvSciSyncFenceUpdateFence(m_WaitEventContext.sync_obj,
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncObjGenerateFence
make: *** [Makefile:69: build/cudla_context_standalone.o] Error 1
make: *** Waiting for unfinished jobs....

WangFengtu1996 commented 6 months ago

@2yjia 我这边尝试去根据仓库的readme，然后去finetune 模型，导出新的模型，这个流程你走通了么，我在 qat->ptq 这个遇到点问题,缺了输出的这个尺度信息。

(py310) orin@orin-root:~/workspace/cuDLA-samples$ python export/qdq_translator/qdq_translator.py --input_onnx_models=yolov5_trimmed_qat.onnx --output_dir=data/model/ --infer_concat_scales --infer_mul_scales 
INFO:root:Parsing yolov5_trimmed_qat.onnx...
INFO:root:No tensor scales for /model.24/m.0/Conv's output tensor s8
INFO:root:No tensor scales for /model.24/m.1/Conv's output tensor s16
INFO:root:No tensor scales for /model.24/m.2/Conv's output tensor s32

WangFengtu1996 commented 6 months ago

@2yjia 设备信息, 我们一致么？

(base) orin@orin-root:/usr/lib/aarch64-linux-gnu/tegra$ jetson_release
Software part of jetson-stats 4.2.4 - (c) 2024, Raffaello Bonghi
Model: Jetson AGX Orin Developer Kit - Jetpack 5.1.2 [L4T 35.4.1]
NV Power Mode[0]: MAXN
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
 - P-Number: p3701-0005
 - Module: NVIDIA Jetson AGX Orin (64GB ram)
Platform:
 - Distribution: Ubuntu 20.04 focal
 - Release: 5.10.120-tegra
jtop:
 - Version: 4.2.4
 - Service: Active
Libraries:
 - CUDA: 11.4.315
 - cuDNN: 8.6.0.166
 - TensorRT: 5.1.2
 - VPI: 2.3.9
 - Vulkan: 1.3.204
 - OpenCV: 4.6.0 - with CUDA: YES

2yjia commented 6 months ago

Software part of jetson-stats 4.2.4 - (c) 2024, Raffaello Bonghi Model: Jetson AGX Orin - Jetpack 5.1 [L4T 35.2.1] NV Power Mode[2]: MODE_30W Serial Number: [XXX Show with: jetson_release -s XXX] Hardware:

P-Number: p3701-0005
Module: NVIDIA Jetson AGX Orin (64GB ram) Platform:
Distribution: Ubuntu 20.04 focal
Release: 5.10.104-tegra jtop:
Version: 4.2.4
Service: Active Libraries:
CUDA: 11.4.315
cuDNN: 8.6.0.166
TensorRT: 5.1
VPI: 2.2.4
Vulkan: 1.3.204
OpenCV: 4.5.4 - with CUDA: NO @WangFengtu1996

2yjia commented 6 months ago

@2yjia 我尝试去根据仓库的自述文件，然后去微调模型，导出新的模型，这个流程你走通了么，我在qat->ptq这个遇到点问题，缺了输出的这个图形信息。
(py310) orin@orin-root:~/workspace/cuDLA-samples$ python export/qdq_translator/qdq_translator.py --input_onnx_models=yolov5_trimmed_qat.onnx --output_dir=data/model/ --infer_concat_scales --infer_mul_scales 
INFO:root:Parsing yolov5_trimmed_qat.onnx...
INFO:root:No tensor scales for /model.24/m.0/Conv's output tensor s8
INFO:root:No tensor scales for /model.24/m.1/Conv's output tensor s16
INFO:root:No tensor scales for /model.24/m.2/Conv's output tensor s32

同样的问题，运行程序后生成了noqdq.onnx，我用这个onnx进行推理部署有一定的问题，不知道作者的fp16和int8两个onnx怎么生成的

WangFengtu1996 commented 6 months ago

@2yjia 我参考了这个issue #7 但是，我这边遇到新的问题

py310) orin@orin-root:~/workspace/cuDLA-samples$ make validate_cudla_int8 USE_DLA_STANDALONE_MODE=1  USE_DETERMINISTIC_SEMAPHORE=1 -j
/usr/local/cuda/bin/nvcc -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include -gencode arch=compute_87,code=sm_87 -c -o build/decode_nms.o src/decode_nms.cu
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/validate_coco.o src/validate_coco.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/yolov5.o src/yolov5.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/cudla_context_hybrid.o src/cudla_context_hybrid.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/cudla_context_standalone.o src/cudla_context_standalone.cpp
src/cudla_context_standalone.cpp: In member function ‘void cuDLAContextStandalone::initialize()’:
src/cudla_context_standalone.cpp:324:19: error: ‘NvSciSyncFenceUpdateFence’ was not declared in this scope; did you mean ‘NvSciSyncObjGenerateFence’?
  324 |     m_nvsci_err = NvSciSyncFenceUpdateFence(m_WaitEventContext.sync_obj, m_WaiterID, m_WaiterValue, m_WaitEventContext.nvsci_fence_ptr);
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncObjGenerateFence
src/cudla_context_standalone.cpp:326:19: error: ‘NvSciSyncFenceExtractFence’ was not declared in this scope; did you mean ‘NvSciSyncIpcExportFence’?
  326 |     m_nvsci_err = NvSciSyncFenceExtractFence(m_WaitEventContext.nvsci_fence_ptr,&m_WaiterID,&m_WaiterValue);
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncIpcExportFence
src/cudla_context_standalone.cpp: In member function ‘int cuDLAContextStandalone::submitDLATask(cudaStream_t)’:
src/cudla_context_standalone.cpp:443:19: error: ‘NvSciSyncFenceExtractFence’ was not declared in this scope; did you mean ‘NvSciSyncIpcExportFence’?
  443 |     m_nvsci_err = NvSciSyncFenceExtractFence(m_WaitEventContext.nvsci_fence_ptr ,&m_WaiterID, &m_WaiterValue);
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncIpcExportFence
src/cudla_context_standalone.cpp:445:19: error: ‘NvSciSyncFenceUpdateFence’ was not declared in this scope; did you mean ‘NvSciSyncObjGenerateFence’?
  445 |     m_nvsci_err = NvSciSyncFenceUpdateFence(m_WaitEventContext.sync_obj,
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncObjGenerateFence
make: *** [Makefile:69: build/cudla_context_standalone.o] Error 1
make: *** Waiting for unfinished jobs....

@2yjia 关于我这个问题，你能在解压nvsci*出来目录，帮我 grep 下着两个函数，看下结果么？十分感谢哈

# 进入 nvsci_headers.tbz2 解压目录
grep -nr "NvSciSyncFenceUpdateFence"

grep -nr "NvSciSyncObjGenerateFence"

ou525 commented 6 months ago

I encountered the same problem, has it been solved?

mchi-zg commented 2 months ago

Hi All, could you try this on Jetpack 6.0 DP+. Thanks!

NVIDIA-AI-IOT / cuDLA-samples

cudla import external semaphore FAILED #15