Tencent / ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform
Other
20.18k stars 4.15k forks source link

Vulkan raises segmentation fault #2354

Closed Ca0L closed 9 months ago

Ca0L commented 3 years ago

Bug

This code raised a segmentation fault.

#include "net.h"
#include <iostream>
#include <string>

int main(int argc, char* argv[])
{
    ncnn::Net model;
    model.opt.use_vulkan_compute = true;
    model.set_vulkan_device(1);
    std::cout << "OK" << std::endl;
    return 0;
}

This is the makefile I used, test-softmax.cpp contains the code above.

repo_home=/data1/home/cailinchao/repos/ncnn/
build_dir=$(repo_home)/build
glslang_build=$(build_dir)/glslang

main: test-softmax.cpp
    g++ test-softmax.cpp -g -o main -lncnnd -lvulkan -lSPIRV -lglslang -lOGLCompiler -lOSDependent -fopenmp -lgomp -I$(repo_home)/src -I$(build_dir)/src -L$(build_dir)/src -L$(glslang_build)/glslang -L$(glslang_build)/OGLCompilersDLL -L$(glslang_build)/SPIRV -L$(glslang_build)/OSDependent/Unix -L$(glslang_build)/glslang/OSDependent/Unix/

.PHONY: clean
clean:
    rm main

This is the output.

[0 GeForce RTX 2080 Ti]  queueC=2[8]  queueG=0[16]  queueT=1[2]
[0 GeForce RTX 2080 Ti]  bugsbn1=0  bugcopc=0  bugihfa=0
[0 GeForce RTX 2080 Ti]  fp16p=1  fp16s=1  fp16a=1  int8s=1  int8a=1
[0 GeForce RTX 2080 Ti]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1
[1 GeForce RTX 2080 Ti]  queueC=2[8]  queueG=0[16]  queueT=1[2]
[1 GeForce RTX 2080 Ti]  bugsbn1=0  bugcopc=0  bugihfa=0
[1 GeForce RTX 2080 Ti]  fp16p=1  fp16s=1  fp16a=1  int8s=1  int8a=1
[1 GeForce RTX 2080 Ti]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1
[2 GeForce GTX 1080 Ti]  queueC=2[8]  queueG=0[16]  queueT=1[2]
[2 GeForce GTX 1080 Ti]  bugsbn1=0  bugcopc=0  bugihfa=0
[2 GeForce GTX 1080 Ti]  fp16p=1  fp16s=1  fp16a=0  int8s=1  int8a=1
[2 GeForce GTX 1080 Ti]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1
[3 GeForce RTX 2080 Ti]  queueC=2[8]  queueG=0[16]  queueT=1[2]
[3 GeForce RTX 2080 Ti]  bugsbn1=0  bugcopc=0  bugihfa=0
[3 GeForce RTX 2080 Ti]  fp16p=1  fp16s=1  fp16a=1  int8s=1  int8a=1
[3 GeForce RTX 2080 Ti]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1
[4 GeForce RTX 2080 Ti]  queueC=2[8]  queueG=0[16]  queueT=1[2]
[4 GeForce RTX 2080 Ti]  bugsbn1=0  bugcopc=0  bugihfa=0
[4 GeForce RTX 2080 Ti]  fp16p=1  fp16s=1  fp16a=1  int8s=1  int8a=1
[4 GeForce RTX 2080 Ti]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1
[5 GeForce GTX 1080 Ti]  queueC=2[8]  queueG=0[16]  queueT=1[2]
[5 GeForce GTX 1080 Ti]  bugsbn1=0  bugcopc=0  bugihfa=0
[5 GeForce GTX 1080 Ti]  fp16p=1  fp16s=1  fp16a=0  int8s=1  int8a=1
[5 GeForce GTX 1080 Ti]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1
OK
Segmentation fault (core dumped)

This is the backtrace:

Reading symbols from ./main...done.
[New LWP 10147]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./main'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x8) at ../nptl/pthread_mutex_lock.c:65
65      ../nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x8) at ../nptl/pthread_mutex_lock.c:65
#1  0x00007effdf454ae5 in ?? () from /usr/lib/x86_64-linux-gnu/libEGL_mesa.so.0
#2  0x00007effe23406fb in eglReleaseThread () from /usr/lib/x86_64-linux-gnu/libEGL.so.1
#3  0x00007effe5562337 in ?? () from /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.0
#4  0x00007effe5561149 in ?? () from /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.0
#5  0x00007effe6dbfcf0 in _dl_close_worker (map=map@entry=0x5626a3378480, force=force@entry=false)
    at dl-close.c:293
#6  0x00007effe6dc0afa in _dl_close_worker (force=false, map=0x5626a3378480) at dl-close.c:125
#7  _dl_close (_map=0x5626a3378480) at dl-close.c:842
#8  0x00007effe5b4451f in __GI__dl_catch_exception (exception=exception@entry=0x7ffddeef81a0, 
    operate=operate@entry=0x7effe57da070 <dlclose_doit>, args=args@entry=0x5626a3378480)
    at dl-error-skeleton.c:196
#9  0x00007effe5b445af in __GI__dl_catch_error (objname=objname@entry=0x5626a3289b50, 
    errstring=errstring@entry=0x5626a3289b58, mallocedp=mallocedp@entry=0x5626a3289b48, 
    operate=operate@entry=0x7effe57da070 <dlclose_doit>, args=args@entry=0x5626a3378480)
    at dl-error-skeleton.c:215
#10 0x00007effe57da745 in _dlerror_run (operate=operate@entry=0x7effe57da070 <dlclose_doit>, 
    args=0x5626a3378480) at dlerror.c:162
#11 0x00007effe57da0b3 in __dlclose (handle=<optimized out>) at dlclose.c:46
#12 0x00007effe6b7aa28 in ?? () from /usr/lib/x86_64-linux-gnu/libvulkan.so.1
#13 0x00007effe6b84d3f in vkDestroyInstance () from /usr/lib/x86_64-linux-gnu/libvulkan.so.1
#14 0x00005626a1ae16cb in ncnn::destroy_gpu_instance ()
    at /data1/home/cailinchao/repos/ncnn/src/gpu.cpp:1025
#15 0x00005626a1aebc0d in ncnn::__ncnn_vulkan_instance_holder::~__ncnn_vulkan_instance_holder (
    this=0x5626a23c6768 <ncnn::g_instance>, __in_chrg=<optimized out>)
    at /data1/home/cailinchao/repos/ncnn/src/gpu.cpp:50
#16 0x00007effe5a200f1 in __run_exit_handlers (status=0, listp=0x7effe5dc8718 <__exit_funcs>, 
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#17 0x00007effe5a201ea in __GI_exit (status=<optimized out>) at exit.c:139
#18 0x00007effe59feb9e in __libc_start_main (main=0x5626a1aad64a <main(int, char**)>, argc=1, 
    argv=0x7ffddeef8428, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7ffddeef8418) at ../csu/libc-start.c:344
#19 0x00005626a1aad56a in _start ()

Environment

OS: Ubuntu 18.04.4 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 10.0.0 
CMake version: version 3.18.2

GPU models and configuration: 
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce GTX 1080 Ti

Nvidia driver version: 450.80.02

ncnn version: commit 60df2740a7a1eb57a2817aca3385c3153d7a5445
ncnn build command: cmake -DCMAKE_BUILD_TYPE=Debug -DNCNN_VULKAN=ON -DNCNN_BUILD_EXAMPLES=ON .. && make -j$(nproc)
Eleanor456 commented 3 years ago

also got the same error, how did you solve the problem

Ca0L commented 3 years ago

also got the same error, how did you solve the problem

Sorry, I haven't solve it yet.

kulicuu commented 3 years ago

I got a segfault, I think/thought it's malformed vertices.

UNASSIGNED-khronos-validation-createinstance-status-message(INFO / SPEC): msgNum: -671457468 - Validation Information: [ UNASSIGNED-khronos-validation-createins
tance-status-message ] Object 0: handle = 0x15cce443550, type = VK_OBJECT_TYPE_INSTANCE; | MessageID = 0xd7fa5f44 | Khronos Validation Layer Active:
    Settings File: Found at C:\Users\wylie\AppData\Local\LunarG\vkconfig\override\vk_layer_settings.txt specified by VkConfig application override.
    Current Enables: None.
    Current Disables: VK_VALIDATION_FEATURE_DISABLE_THREAD_SAFETY_EXT.

    Objects: 1
        [0] 0x15cce443550, type: 1, name: NULL
INFO:
GENERAL [Loader Message (0)] : Inserted device layer VK_LAYER_KHRONOS_validation (C:\VulkanSDK\1.2.182.0\Bin\\.\VkLayer_khronos_validation.dll)

INFO:
GENERAL [Loader Message (0)] : Inserted device layer VK_LAYER_OBS_HOOK (C:\ProgramData\obs-studio-hook\.\graphics-hook64.dll)

INFO:
GENERAL [Loader Message (0)] : Inserted device layer VK_LAYER_NV_optimus (C:\WINDOWS\System32\DriverStore\FileRepository\nvltwi.inf_amd64_62c6fe9661e469e3\.\nvo
glv64.dll)

error: process didn't exit successfully: `target\debug\peregrine.exe` (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)
Segmentation fault
JujuDel commented 2 years ago

If this topic is still open, here is how to fix it

Solution

#include "net.h"
#include <iostream>
#include <string>

int main(int argc, char* argv[])
{
    ncnn::Net model;
    model.opt.use_vulkan_compute = true;
    model.set_vulkan_device(1);
    ncnn::destroy_gpu_instance(); // <--- Add this
    std::cout << "OK" << std::endl;
    return 0;
}

Short explanation

While this hasn't been fixed, I believe that you should explicitly call destroy_gpu_instance() if the code calls create_gpu_instance() at some point (which is done with set_vulkan_device in your case)

Theoretically, it's already done inside of ~__ncnn_vulkan_instance_holder() (see gpu.h and gpu.cpp) while deleting the static __ncnn_vulkan_instance_holder g_instance;

Commit used

I've done that being on the tag 20210720

lblbk commented 1 year ago

@JujuDel hello, after searching for the issue, I found that this solution is to solve this problem , and has no effect on the current problem, do you have any other solution? thx.

nihui commented 9 months ago

https://github.com/Tencent/ncnn/pull/5234

nihui commented 9 months ago

https://github.com/Tencent/ncnn/commit/ded0b78bb2926d3fd116c94d988c8e3d836d191a