AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.73k stars 7.96k forks source link

Segmentation fault => opencv 4.3.0 + cuda10.1 + cudnn7.6 + darknet lastest codebase(till May 8) #5534

Closed tomingliu closed 4 years ago

tomingliu commented 4 years ago

Dear @AlexeyAB, thanks for your great work! could you please support me on below issue? thanks in advance

I'm building the newest darknet tree using below command on CentOS 7, It is seccessful

make -j8 GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 LIBSO=1 AVX=1 DEBUG=1

I also build opencv 4.3.0 using below command

cmake -D CMAKE_BUILD_TYPE=DEBUG \ -D CMAKE_INSTALL_PREFIX=/usr \ -D INSTALL_PYTHON_EXAMPLES=ON \ -D INSTALL_C_EXAMPLES=OFF \ -D OPENCV_ENABLE_NONFREE=ON \ -D WITH_CUDA=ON \ -D WITH_CUDNN=ON \ -D WITH_CUFFT=ON \ -D OPENCV_DNN_CUDA=ON \ -D ENABLE_FAST_MATH=1 \ -D CUDA_FAST_MATH=1 \ -D CUDA_ARCH_BIN=7.0,7.5 \ -D CUDA_ARCH_PTX= \ -D WITH_CUBLAS=1 \ -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib-4.3.0/modules -DBUILD_opencv_freetype=ON \ -D HAVE_opencv_python3=ON \ -D BUILD_TESTS=OFF \ -D OPENCV_GENERATE_PKGCONFIG=ON \ -D BUILD_EXAMPLES=OFF ..

My GPU is nvidia Tesla T4

When I try to train v4 model whith my private dataset, I always get segmentation fault issue even disabled mosaic,

So I decided using gdb to debug this issue, here is gdb output. could you give me more hints to locate this issue?

somehow if I switch off opencv support, the error goes away

gdb ./darknet core.27227

GNU gdb (GDB) Red Hat Enterprise Linux 8.2-3.el7 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./darknet...done.

[New LWP 27238] [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib64/libthread_db.so.1". Core was generated by `./darknet detector train work_dir/0507/obj_vehicle_det.data work_dir/0507/yolo4'. Program terminated with signal SIGSEGV, Segmentation fault.

0 0x00007f803234e426 in __memcpy_ssse3_back () from /usr/lib64/libc.so.6

[Current thread is 1 (Thread 0x7f8072f7a580 (LWP 27227))] Missing separate debuginfos, use: debuginfo-install atk-2.28.1-2.el7.x86_64 (gdb) quit

AlexeyAB commented 4 years ago

Compile with this command

make clean
make GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 LIBSO=0 OPENMP=0 AVX=0 DEBUG=0 -j 8

Run training and show screenshot of the error screen.

tomingliu commented 4 years ago

issue still happened with your instruction, here is the error

[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05 nms_kind: greedynms (1), beta = 0.600000 Total BFLOPS 127.248 avg_outputs = 1046494 Allocate additional workspace_size = 118.88 MB Loading weights from work_dir/0507/weights/yolo4_vehicle_det_last.weights... seen 64, trained: 332 K-images (5 Kilo-batches_64) Done! Loaded 162 layers from weights-file Learning Rate: 0.001, Momentum: 0.949, Decay: 0.0005 Resizing, random_coef = 1.40

896 x 896 Create 6 permanent cpu-threads try to allocate additional workspace_size = 258.14 MB CUDA allocate done! Loaded: 9.742663 seconds Segmentation fault (core dumped)

gdb ./darknet core.28634

GNU gdb (GDB) Red Hat Enterprise Linux 8.2-3.el7 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./darknet...done. [New LWP 28634] [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib64/libthread_db.so.1". Core was generated by `./darknet detector train work_dir/0507/obj_vehicle_det.data work_dir/0507/yolo4'. Program terminated with signal SIGSEGV, Segmentation fault.

0 0x00007f4455220a35 in __memcpy_ssse3_back () from /usr/lib64/libc.so.6

[Current thread is 1 (Thread 0x7f4495e4e580 (LWP 28634))] Missing separate debuginfos, use: debuginfo-install atk-2.28.1-2.el7.x86_64

AlexeyAB commented 4 years ago
tomingliu commented 4 years ago

[root@a006f2fd7e3a darknet]# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243 [root@a006f2fd7e3a darknet]# gcc --version gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3) Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[root@a006f2fd7e3a darknet]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6151 CPU @ 3.00GHz Stepping: 4 CPU MHz: 3000.000 BogoMIPS: 6000.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 25344K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat md_clear spec_ctrl intel_stibp flush_l1d [root@a006f2fd7e3a darknet]# nvidia-smi Fri May 8 10:49:35 2020
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:21:01.0 Off | 0 | | N/A 59C P0 31W / 70W | 0MiB / 15079MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ [root@a006f2fd7e3a darknet]#

tomingliu commented 4 years ago

Show content of files bad.list and bad_label.list if they exist [Toming]: I don't have such list file

Do you get the same issue if you compile with OPENCV=0? [Toming]: No, As I mentioned before, if I turn off opencv support(OPENCV=0), the error goes away

If you can run it successfully without OpenCV, then try to compile OpenCV with -D CMAKE_BUILD_TYPE=RELEASE -D ENABLE_FAST_MATH=0 [Toming]: Okay, I'll try this setting, will let you know the result

tomingliu commented 4 years ago

If you can run it successfully without OpenCV, then try to compile OpenCV with -D CMAKE_BUILD_TYPE=RELEASE -D ENABLE_FAST_MATH=0 [Toming]: unfortunately, it doesn't work,-:(

v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 161 Avg (IOU: 0.868231, GIOU: 0.867077), Class: 0.999611, Obj: 0.741153, No Obj: 0.006386, .5R: 1.000000, .75R: 0.882353, count: 17, class_loss = 1.773138, iou_loss = 8.108455, total_loss = 9.881593 Segmentation fault (core dumped)

AlexeyAB commented 4 years ago

Something wrong with your OpenCV.

Try to build it with

cmake -D CMAKE_BUILD_TYPE=RELEASE
-D CMAKE_INSTALL_PREFIX=/usr
-D INSTALL_C_EXAMPLES=OFF
-D BUILD_TESTS=OFF
-D BUILD_EXAMPLES=OFF ..

then do

make -j8
make install

Then compile Darknet:

make clean
make GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 LIBSO=0 OPENMP=0 AVX=0 DEBUG=0 -j 8
tomingliu commented 4 years ago

you mean I cannot build opencv with cuda & cudnn support?

AlexeyAB commented 4 years ago

I don't know, I suggest you to find the reason of this error.

cenit commented 4 years ago

A couple of suggestions: first, cmake is case sensitive for defines, so you have to use “Debug” and “Release”, not the uppercase version, otherwise they are unrecognized and “Debug” is applied. Then, if you want to have a proper gdb experience, you should build also Darknet with symbols, so with cmake with similar commands, or with make manually editing the makefile and adding -g almost everywhere

tomingliu commented 4 years ago

Dear @cenit @AlexeyAB, Thanks for your suggestion! Yes, I'm trying to locate where the error is from by using gdb debugger. so I have modified makefile in the darknet's tree to include symbols. also checked opencv cmakefile, and I found it will convert to the correct case

./build/CMakeCache.txt:2367:CMAKE_BUILD_TYPE-STRINGS:INTERNAL=Debug;Release

AlexeyAB commented 4 years ago
tomingliu commented 4 years ago

issue just happen during training model, here is my outputs using the similar command, BTW, my linux host doesn't have xwindow

darknet detector demo cfg/coco.data cfg/yolov4.cfg yolov4.weights ../yolact/869455049187012-1585776913091.mp4 -out_filename out.mp4 -dont_show

CUDA-version: 10010 (10010), cuDNN: 7.6.5, CUDNN_HALF=1, GPU count: 1
CUDNN_HALF=1 OpenCV version: 4.3.0 Demo 0 : compute_capability = 750, cudnn_half = 1, GPU: Tesla T4 net.optimized_memory = 0 mini_batch = 1, batch = 8, time_steps = 1, train = 0 layer filters size/strd(dil) input output

tomingliu commented 4 years ago

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243

nvidia-smi

Sat May 9 11:50:05 2020
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:21:01.0 Off | 0 | | N/A 37C P0 27W / 70W | 0MiB / 15079MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

tomingliu commented 4 years ago

the program will crash in this function(src/data.c, line 1918)

void get_next_batch(data d, int n, int offset, float *X, float *y)
{
    int j;
    for(j = 0; j < n; ++j){
        int index = offset + j;printf("==========> %d@%s: %d, %d\n", __LINE__, __FILE__, index, d.X.cols*sizeof(float));
        memcpy(X+j*d.X.cols, d.X.vals[index], d.X.cols*sizeof(float));printf("==========> %d@%s: %d, %d\n", __LINE__, __FILE__, index, d.y.cols*sizeof(float));
        memcpy(y+j*d.y.cols, d.y.vals[index], d.y.cols*sizeof(float));printf("==========> %d@%s: %d, %d\n", __LINE__, __FILE__, index, offset);
    }
}
AlexeyAB commented 4 years ago

What batch= and subdivisions= do you use in cfg-file? Check your training dataset, and remove empty lines in train.txt file

tomingliu commented 4 years ago

Okay let me check the train.txt right now, just a second. please see my cfg file

batch=64 subdivisions=32 width=608 height=608 channels=3 momentum=0.949 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1 mixup=1

learning_rate=0.001 burn_in=1000 max_batches = 20000 policy=steps steps=4800,8000 scales=.1,.1

cutmix=1 mosaic=1

tomingliu commented 4 years ago

my train.txt doesn't have empty line, it has many long filename, but I don't think this the root cause

/home/guest/toming/darknet/JPEGImages/7d276e0b420f82e0bd635370e66bd9c52ebba514d84b689b8d69552218e83d77-1584333158408.jpg

AlexeyAB commented 4 years ago

remove cutmix=1 line from cfg-file

tomingliu commented 4 years ago

it works by removing cutmix now, as I know cutmix is a good data augmentation method. sounds like a bug related to opencv

AlexeyAB commented 4 years ago

There is no bug in CutMix. CutMix is supported only for Classifier, not for Detector. But for some reason this error message isn't showed in some cases: https://github.com/AlexeyAB/darknet/blob/da3de2bdf90f21e847536f1ead7c1df14f83f3eb/src/data.c#L955-L958


Try to add fflush(stdout); line between these two lines and recompile https://github.com/AlexeyAB/darknet/blob/da3de2bdf90f21e847536f1ead7c1df14f83f3eb/src/data.c#L956-L957

Do you see such error message? image

tomingliu commented 4 years ago

no, I don't see such error message, otherwise maybe I will remove cutmix. I will try fflush and let you know

tomingliu commented 4 years ago

tried to print out use_mixup, what I get is

==========> 954@./src/data.c: 4 ==========> 954@./src/data.c: 4 ==========> 954@./src/data.c: 4 ==========> 954@./src/data.c: 4 ...

data load_data_detection(int n, char **paths, int m, int w, int h, int c, int boxes, int classes, int use_flip, int use_gaussian_noise, int use_blur, int use_mixup,
    float jitter, float hue, float saturation, float exposure, int mini_batch, int track, int augment_speed, int letter_box, int show_imgs)
{
    const int random_index = random_gen();
    c = c ? c : 3;
    printf("==========> %d@%s: %d\n", __LINE__, __FILE__, use_mixup);
    if (use_mixup == 2) {
        printf("\n cutmix=1 - isn't supported for Detector \n");
        exit(0);
    }
    if (use_mixup == 3 && letter_box) {
        printf("\n Combination: letter_box=1 & mosaic=1 - isn't supported, use only 1 of these parameters \n");
        exit(0);
    }
tomingliu commented 4 years ago

I modified code as below

    if (use_mixup == 2 || use_mixup == 4) {
        printf("\n cutmix=1 - isn't supported for Detector \n");
        exit(0);
    }

the error message is similar

==========> 954@./src/data.c: 4

cutmix=1 - isn't supported for Detector ==========> 954@./src/data.c: 4

cutmix=1 - isn't supported for Detector ==========> 954@./src/data.c: 4

cutmix=1 - isn't supported for Detector Error in `./darknet': double free or corruption (!prev): 0x0000000002ceafe0 Error in `./darknet': corrupted double-linked list: 0x0000000002ceafd0 ==========> 954@./src/data.c: 4

cutmix=1 - isn't supported for Detector Segmentation fault (core dumped)

AlexeyAB commented 4 years ago

I fixed it: https://github.com/AlexeyAB/darknet/commit/6ad485c0c163798a882f64c71d013dc80460b696

tomingliu commented 4 years ago

okay merged to my local base, also here the patch is for your referecenc

diff --git a/src/image_opencv.cpp b/src/image_opencv.cpp
index 4e852db..e635b30 100644
--- a/src/image_opencv.cpp
+++ b/src/image_opencv.cpp
@@ -1015,6 +1015,7 @@ extern "C" mat_cv* draw_train_chart(char *windows_name, float max_img_loss, int
     try {
         // load chart from file
         if (chart_path != NULL && chart_path[0] != '\0') {
+            release_mat((mat_cv**)&img_ptr);
             *img_ptr = cv::imread(chart_path);
         }
         else {
tomingliu commented 4 years ago

issue closed after applying this patch

tomingliu commented 4 years ago

My mistake, sorry! yes I see, it will be seg-fault. thanks again!

Tianyi-THU commented 4 years ago

I have got the same error message( "double free or corruption") when using mosaic=1. Any update on this issue, please?

AlexeyAB commented 4 years ago

Use the latest version of Darknet.