Closed tomingliu closed 4 years ago
Compile with this command
make clean
make GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 LIBSO=0 OPENMP=0 AVX=0 DEBUG=0 -j 8
Run training and show screenshot of the error screen.
issue still happened with your instruction, here is the error
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05 nms_kind: greedynms (1), beta = 0.600000 Total BFLOPS 127.248 avg_outputs = 1046494 Allocate additional workspace_size = 118.88 MB Loading weights from work_dir/0507/weights/yolo4_vehicle_det_last.weights... seen 64, trained: 332 K-images (5 Kilo-batches_64) Done! Loaded 162 layers from weights-file Learning Rate: 0.001, Momentum: 0.949, Decay: 0.0005 Resizing, random_coef = 1.40
896 x 896 Create 6 permanent cpu-threads try to allocate additional workspace_size = 258.14 MB CUDA allocate done! Loaded: 9.742663 seconds Segmentation fault (core dumped)
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-3.el7 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.
For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./darknet...done. [New LWP 28634] [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib64/libthread_db.so.1". Core was generated by `./darknet detector train work_dir/0507/obj_vehicle_det.data work_dir/0507/yolo4'. Program terminated with signal SIGSEGV, Segmentation fault.
[Current thread is 1 (Thread 0x7f4495e4e580 (LWP 28634))] Missing separate debuginfos, use: debuginfo-install atk-2.28.1-2.el7.x86_64
What CPU do you use?
Show output of commands:
nvcc --version
gcc --version
lscpu
nvidia-smi
Show content of files bad.list and bad_label.list if they exist
Do you get the same issue if you compile with OPENCV=0?
If you can run it successfully without OpenCV, then try to compile OpenCV with
-D CMAKE_BUILD_TYPE=RELEASE -D ENABLE_FAST_MATH=0
[root@a006f2fd7e3a darknet]# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243 [root@a006f2fd7e3a darknet]# gcc --version gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3) Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[root@a006f2fd7e3a darknet]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6151 CPU @ 3.00GHz
Stepping: 4
CPU MHz: 3000.000
BogoMIPS: 6000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat md_clear spec_ctrl intel_stibp flush_l1d
[root@a006f2fd7e3a darknet]# nvidia-smi
Fri May 8 10:49:35 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:21:01.0 Off | 0 |
| N/A 59C P0 31W / 70W | 0MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ [root@a006f2fd7e3a darknet]#
Show content of files bad.list and bad_label.list if they exist [Toming]: I don't have such list file
Do you get the same issue if you compile with OPENCV=0? [Toming]: No, As I mentioned before, if I turn off opencv support(OPENCV=0), the error goes away
If you can run it successfully without OpenCV, then try to compile OpenCV with -D CMAKE_BUILD_TYPE=RELEASE -D ENABLE_FAST_MATH=0 [Toming]: Okay, I'll try this setting, will let you know the result
If you can run it successfully without OpenCV, then try to compile OpenCV with -D CMAKE_BUILD_TYPE=RELEASE -D ENABLE_FAST_MATH=0 [Toming]: unfortunately, it doesn't work,-:(
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 161 Avg (IOU: 0.868231, GIOU: 0.867077), Class: 0.999611, Obj: 0.741153, No Obj: 0.006386, .5R: 1.000000, .75R: 0.882353, count: 17, class_loss = 1.773138, iou_loss = 8.108455, total_loss = 9.881593 Segmentation fault (core dumped)
Something wrong with your OpenCV.
Try to build it with
cmake -D CMAKE_BUILD_TYPE=RELEASE
-D CMAKE_INSTALL_PREFIX=/usr
-D INSTALL_C_EXAMPLES=OFF
-D BUILD_TESTS=OFF
-D BUILD_EXAMPLES=OFF ..
then do
make -j8
make install
Then compile Darknet:
make clean
make GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 LIBSO=0 OPENMP=0 AVX=0 DEBUG=0 -j 8
you mean I cannot build opencv with cuda & cudnn support?
I don't know, I suggest you to find the reason of this error.
A couple of suggestions: first, cmake is case sensitive for defines, so you have to use “Debug” and “Release”, not the uppercase version, otherwise they are unrecognized and “Debug” is applied. Then, if you want to have a proper gdb experience, you should build also Darknet with symbols, so with cmake with similar commands, or with make manually editing the makefile and adding -g almost everywhere
Dear @cenit @AlexeyAB, Thanks for your suggestion! Yes, I'm trying to locate where the error is from by using gdb debugger. so I have modified makefile in the darknet's tree to include symbols. also checked opencv cmakefile, and I found it will convert to the correct case
./build/CMakeCache.txt:2367:CMAKE_BUILD_TYPE-STRINGS:INTERNAL=Debug;Release
Show such screenshot:
Show output of
nvcc --version
nvidia-smi
issue just happen during training model, here is my outputs using the similar command, BTW, my linux host doesn't have xwindow
darknet detector demo cfg/coco.data cfg/yolov4.cfg yolov4.weights ../yolact/869455049187012-1585776913091.mp4 -out_filename out.mp4 -dont_show
CUDA-version: 10010 (10010), cuDNN: 7.6.5, CUDNN_HALF=1, GPU count: 1
CUDNN_HALF=1
OpenCV version: 4.3.0
Demo
0 : compute_capability = 750, cudnn_half = 1, GPU: Tesla T4
net.optimized_memory = 0
mini_batch = 1, batch = 8, time_steps = 1, train = 0
layer filters size/strd(dil) input output
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243
nvidia-smi
Sat May 9 11:50:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:21:01.0 Off | 0 |
| N/A 37C P0 27W / 70W | 0MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
the program will crash in this function(src/data.c, line 1918)
void get_next_batch(data d, int n, int offset, float *X, float *y)
{
int j;
for(j = 0; j < n; ++j){
int index = offset + j;printf("==========> %d@%s: %d, %d\n", __LINE__, __FILE__, index, d.X.cols*sizeof(float));
memcpy(X+j*d.X.cols, d.X.vals[index], d.X.cols*sizeof(float));printf("==========> %d@%s: %d, %d\n", __LINE__, __FILE__, index, d.y.cols*sizeof(float));
memcpy(y+j*d.y.cols, d.y.vals[index], d.y.cols*sizeof(float));printf("==========> %d@%s: %d, %d\n", __LINE__, __FILE__, index, offset);
}
}
What batch= and subdivisions= do you use in cfg-file? Check your training dataset, and remove empty lines in train.txt file
Okay let me check the train.txt right now, just a second. please see my cfg file
batch=64 subdivisions=32 width=608 height=608 channels=3 momentum=0.949 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1 mixup=1
learning_rate=0.001 burn_in=1000 max_batches = 20000 policy=steps steps=4800,8000 scales=.1,.1
cutmix=1 mosaic=1
my train.txt doesn't have empty line, it has many long filename, but I don't think this the root cause
/home/guest/toming/darknet/JPEGImages/7d276e0b420f82e0bd635370e66bd9c52ebba514d84b689b8d69552218e83d77-1584333158408.jpg
remove cutmix=1
line from cfg-file
it works by removing cutmix now, as I know cutmix is a good data augmentation method. sounds like a bug related to opencv
There is no bug in CutMix. CutMix is supported only for Classifier, not for Detector. But for some reason this error message isn't showed in some cases: https://github.com/AlexeyAB/darknet/blob/da3de2bdf90f21e847536f1ead7c1df14f83f3eb/src/data.c#L955-L958
Try to add fflush(stdout);
line between these two lines and recompile
https://github.com/AlexeyAB/darknet/blob/da3de2bdf90f21e847536f1ead7c1df14f83f3eb/src/data.c#L956-L957
Do you see such error message?
no, I don't see such error message, otherwise maybe I will remove cutmix. I will try fflush and let you know
tried to print out use_mixup, what I get is
==========> 954@./src/data.c: 4 ==========> 954@./src/data.c: 4 ==========> 954@./src/data.c: 4 ==========> 954@./src/data.c: 4 ...
data load_data_detection(int n, char **paths, int m, int w, int h, int c, int boxes, int classes, int use_flip, int use_gaussian_noise, int use_blur, int use_mixup,
float jitter, float hue, float saturation, float exposure, int mini_batch, int track, int augment_speed, int letter_box, int show_imgs)
{
const int random_index = random_gen();
c = c ? c : 3;
printf("==========> %d@%s: %d\n", __LINE__, __FILE__, use_mixup);
if (use_mixup == 2) {
printf("\n cutmix=1 - isn't supported for Detector \n");
exit(0);
}
if (use_mixup == 3 && letter_box) {
printf("\n Combination: letter_box=1 & mosaic=1 - isn't supported, use only 1 of these parameters \n");
exit(0);
}
I modified code as below
if (use_mixup == 2 || use_mixup == 4) {
printf("\n cutmix=1 - isn't supported for Detector \n");
exit(0);
}
the error message is similar
==========> 954@./src/data.c: 4
cutmix=1 - isn't supported for Detector ==========> 954@./src/data.c: 4
cutmix=1 - isn't supported for Detector ==========> 954@./src/data.c: 4
cutmix=1 - isn't supported for Detector Error in `./darknet': double free or corruption (!prev): 0x0000000002ceafe0 Error in `./darknet': corrupted double-linked list: 0x0000000002ceafd0 ==========> 954@./src/data.c: 4
cutmix=1 - isn't supported for Detector Segmentation fault (core dumped)
okay merged to my local base, also here the patch is for your referecenc
diff --git a/src/image_opencv.cpp b/src/image_opencv.cpp
index 4e852db..e635b30 100644
--- a/src/image_opencv.cpp
+++ b/src/image_opencv.cpp
@@ -1015,6 +1015,7 @@ extern "C" mat_cv* draw_train_chart(char *windows_name, float max_img_loss, int
try {
// load chart from file
if (chart_path != NULL && chart_path[0] != '\0') {
+ release_mat((mat_cv**)&img_ptr);
*img_ptr = cv::imread(chart_path);
}
else {
issue closed after applying this patch
My mistake, sorry! yes I see, it will be seg-fault. thanks again!
I have got the same error message( "double free or corruption") when using mosaic=1. Any update on this issue, please?
Use the latest version of Darknet.
Dear @AlexeyAB, thanks for your great work! could you please support me on below issue? thanks in advance
I'm building the newest darknet tree using below command on CentOS 7, It is seccessful
make -j8 GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 LIBSO=1 AVX=1 DEBUG=1
I also build opencv 4.3.0 using below command
cmake -D CMAKE_BUILD_TYPE=DEBUG \ -D CMAKE_INSTALL_PREFIX=/usr \ -D INSTALL_PYTHON_EXAMPLES=ON \ -D INSTALL_C_EXAMPLES=OFF \ -D OPENCV_ENABLE_NONFREE=ON \ -D WITH_CUDA=ON \ -D WITH_CUDNN=ON \ -D WITH_CUFFT=ON \ -D OPENCV_DNN_CUDA=ON \ -D ENABLE_FAST_MATH=1 \ -D CUDA_FAST_MATH=1 \ -D CUDA_ARCH_BIN=7.0,7.5 \ -D CUDA_ARCH_PTX= \ -D WITH_CUBLAS=1 \ -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib-4.3.0/modules -DBUILD_opencv_freetype=ON \ -D HAVE_opencv_python3=ON \ -D BUILD_TESTS=OFF \ -D OPENCV_GENERATE_PKGCONFIG=ON \ -D BUILD_EXAMPLES=OFF ..
My GPU is nvidia Tesla T4
When I try to train v4 model whith my private dataset, I always get segmentation fault issue even disabled mosaic,
So I decided using gdb to debug this issue, here is gdb output. could you give me more hints to locate this issue?
somehow if I switch off opencv support, the error goes away
gdb ./darknet core.27227
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-3.el7 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.
For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./darknet...done.
[New LWP 27238] [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib64/libthread_db.so.1". Core was generated by `./darknet detector train work_dir/0507/obj_vehicle_det.data work_dir/0507/yolo4'. Program terminated with signal SIGSEGV, Segmentation fault.
0 0x00007f803234e426 in __memcpy_ssse3_back () from /usr/lib64/libc.so.6
[Current thread is 1 (Thread 0x7f8072f7a580 (LWP 27227))] Missing separate debuginfos, use: debuginfo-install atk-2.28.1-2.el7.x86_64 (gdb) quit