AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.56k stars 7.94k forks source link

core dump / crash while training... #2073

Open kooscode opened 5 years ago

kooscode commented 5 years ago
 200299: 0.646003, 0.611384 avg loss, 0.000010 rate, 1.725003 seconds, 25638272 images
Loaded: 0.000021 seconds
Region Avg IOU: 0.603030, Class: 0.983126, Obj: 0.497180, No Obj: 0.004199, Avg Recall: 0.800000,  count: 5
Region Avg IOU: 0.350961, Class: 0.967752, Obj: 0.126551, No Obj: 0.002759, Avg Recall: 0.400000,  count: 25
Region Avg IOU: 0.699325, Class: 0.821471, Obj: 0.241048, No Obj: 0.003465, Avg Recall: 1.000000,  count: 7
Region Avg IOU: 0.643683, Class: 0.881017, Obj: 0.222286, No Obj: 0.003888, Avg Recall: 0.916667,  count: 12
Region Avg IOU: 0.582219, Class: 0.998859, Obj: 0.344068, No Obj: 0.005522, Avg Recall: 0.651163,  count: 43
Region Avg IOU: 0.521191, Class: 0.777599, Obj: 0.138935, No Obj: 0.003946, Avg Recall: 0.666667,  count: 12
Region Avg IOU: 0.632815, Class: 0.871974, Obj: 0.264270, No Obj: 0.005056, Avg Recall: 0.888889,  count: 18
Region Avg IOU: 0.625067, Class: 0.954754, Obj: 0.362314, No Obj: 0.003551, Avg Recall: 0.857143,  count: 7
Region Avg IOU: 0.721166, Class: 0.711777, Obj: 0.359061, No Obj: 0.003997, Avg Recall: 0.888889,  count: 9
Region Avg IOU: 0.740996, Class: 0.967526, Obj: 0.148962, No Obj: 0.004024, Avg Recall: 0.666667,  count: 3
Region Avg IOU: 0.477621, Class: 0.888341, Obj: 0.162392, No Obj: 0.004112, Avg Recall: 0.562500,  count: 16
Region Avg IOU: 0.685298, Class: 0.928768, Obj: 0.221955, No Obj: 0.003680, Avg Recall: 0.800000,  count: 5
Region Avg IOU: 0.653281, Class: 0.863670, Obj: 0.436116, No Obj: 0.004676, Avg Recall: 0.916667,  count: 12
Region Avg IOU: 0.611492, Class: 0.989244, Obj: 0.459554, No Obj: 0.005161, Avg Recall: 0.777778,  count: 18
Region Avg IOU: 0.762800, Class: 0.954189, Obj: 0.176579, No Obj: 0.004138, Avg Recall: 1.000000,  count: 9
Region Avg IOU: 0.621170, Class: 0.640607, Obj: 0.185495, No Obj: 0.003709, Avg Recall: 0.700000,  count: 10

 200300: 0.466167, 0.596862 avg loss, 0.000010 rate, 1.728509 seconds, 25638400 images
terminate called after throwing an instance of 'cv::Exception'
  what():  OpenCV(3.4.3) /data/sources/opencv/modules/core/src/matrix_wrap.cpp:800: error: (-215:Assertion failed) (flags & FIXED_TYPE) != 0 in function 'type'
kooscode commented 5 years ago

gdb back trace:

(gdb) bt
#0  0x00007f800649fe97 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f80064a1801 in __GI_abort () at abort.c:79
#2  0x00007f80072dd8b7 in  () at /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f80072e3a06 in  () at /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f80072e3a41 in  () at /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f80072e3c74 in  () at /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8025312dca in cv::error(cv::Exception const&) () at /usr/local/lib/libopencv_core.so.3.4
#7  0x00007f8025313c6f in cv::error(int, cv::String const&, char const*, char const*, int) ()
    at /usr/local/lib/libopencv_core.so.3.4
#8  0x00007f8025245579 in _ZN2cvL13errorNoReturnEiRKNS_6StringEPKcS4_i () at /usr/local/lib/libopencv_core.so.3.4
#9  0x00007f8025247fa7 in cv::_InputArray::type(int) const () at /usr/local/lib/libopencv_core.so.3.4
#10 0x00007f802511251c in cv::Mat::copyTo(cv::_OutputArray const&) const () at /usr/local/lib/libopencv_core.so.3.4
#11 0x00007f80290be7c4 in cvSaveImage () at /usr/local/lib/libopencv_imgcodecs.so.3.4
#12 0x0000557b27b8826f in draw_train_loss ()
#13 0x0000557b27bc0318 in train_detector ()
#14 0x0000557b27bc1f20 in run_detector ()
#15 0x0000557b27b51191 in main ()
kooscode commented 5 years ago

FYI - there is a bug in OpenCV's retired CAPI..

Here is a fix.. it writes a PNG instead of JPG, but at least it no longer crashes...

I created a pull request as well..

In _"draw_trainloss" function from "image.c" , replace:

cvSaveImage("chart.jpg", img, 0);

with

stbi_write_png("chart.png", img->width, img->height, 3,  (char *)img->imageData, 0);
Sephirot1st commented 5 years ago

This repository does not supports OpenCV higher than 3.4.0, maybe you should re-compile without OpenCV.

kooscode commented 5 years ago

@Sephirot1st - Like I said, I fixed the code and its working now...

lvshuaigg commented 5 years ago

@kooscode I also have the same problem. Will there be no training errors as you change?

kooscode commented 5 years ago

The only change is saving the png file of the graph.

AlexeyAB commented 5 years ago

@lvshuaigg

Try to change this line: https://github.com/AlexeyAB/darknet/blob/95773cfb423266b9ac6aeea54e862db5817b5447/src/image.c#L773 to this: stbi_write_png("chart.png", img->width, img->height, 3, (char *)img->imageData, 0);

Does it solve your problem?

lvshuaigg commented 5 years ago

I also had this problem in thousands of training sessions. If I changed it to yours, I would not have any problem in the following training sessions

AlexeyAB commented 5 years ago

@lvshuaigg I added this fix: https://github.com/AlexeyAB/darknet/commit/dc827f4c1c49907d7061c63fdc9a634cc82a43d7#diff-d63433b66e54cd65ded01be20041119cR776

Try to update your code from GitHub. Does it solve your problem?

lvshuaigg commented 5 years ago

@AlexeyAB qq 20181226195139

Hello, I have added another 8G memory module, now a total of 16G memory, but when he runs to 2W many times, there is a problem. And I use the latest repository.

AlexeyAB commented 5 years ago

@lvshuaigg I think your message is related to this Issue: https://github.com/AlexeyAB/darknet/issues/2110