AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.65k stars 7.96k forks source link

VOC training crash at the very begining on Windows #2124

Open syyao98 opened 5 years ago

syyao98 commented 5 years ago

Hardware

Intel i7-6700HQ Nvidia 970M 6G 16G RAM

Software

Windows 10 VS2015 CUDA10 cudnn 7.4.1 opencv 3.2

Build option

Open darknet.sln -> build and successed(Release and x64)

Problem

Followed this instruction Train a Classifier on CIFAR-10, successed training the classifier. So classifier.c works well i guess. But followed the VOC training instruction. Either AlexeyAB's or pjreddie's. Training crashed at the very begining.

Troubleshooting

I read some related issues and did some troubleshooting, couldn't locate the error. cfg/voc.data

classes=20
train  = data/voc/train.txt
valid  = data/voc/2007_test.txt
names = data/voc.names
backup = backup/
AlexeyAB commented 5 years ago

@syyao98 Hi,

Using this command: darknet.exe detector map data/voc.data cfg/yolov2-voc.cfg yolo-voc.weights

opencv_world340.dll, opencv_ffmpeg340_64.dll, cudnn64_7.dll, cublas64_100.dll, cudart64_100.dll, cufft64_100.dll, cusolver64_100.dll, cusparse64_100.dll pthreadVC2.dll, pthreadGC2.dll,


I can successfully train yolov2_voc.cfg on PascalVOC dataset: darknet.exe detector train data/voc.data cfg/yolov2-voc.cfg darknet19_448.conv.23 -dont_show

File data/voc.data:

classes= 20
train  = data/train_voc.txt
valid  = data/2007_test.txt
#difficult = data/difficult_2007_test.txt
names = data/voc.names
backup = backup/

File data/train_voc.txt:

E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000012.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000017.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000023.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000026.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000032.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000033.jpg
...

image

syyao98 commented 5 years ago

@AlexeyAB

With nothing changed

Using the script calc_mAP_voc_py.cmd image It looks like the eval procedure function well image Using darknet.exe detector map cfg/voc.data cfg/yolov2-voc.cfg weights/yolo-voc.weights (my weights file all stored at x64/weights and i use cfg/voc.data instead of data/voc.data), Get the result below: image

My opencv version may be 3.2? I have opencv_ffmpeg320_64 and opencv_world320.dll with darknet.exe image I can use darknet.exe detector demo to use yolov2 or yolov3 weight to do the real time detection(about 12FPS), this may use opencv? So i guess opencv may not be the issue? And i don't have cudnn64_7.dll,cublas64_100.dll, cudart64_100.dll, cufft64_100.dll, cusolver64_100.dll, cusparse64_100.dll at the root_dir ./x64, i just set the environment variable for the CUDA.

With library files be copied

image image It just crashed again... :cry:

AlexeyAB commented 5 years ago

@syyao98 This is very strange.

image


  1. Try to set train= the same as the current valid= in the voc.data file:

    train  = data/voc/2007_test.txt
    valid  = data/voc/2007_test.txt

    And try to train.

  2. If it doesn't help - try to compile Darknet without OpenCV: Open darknet.sln in the MSVS, (right click on project) -> properties -> C/C++ -> Preprocessor -> Preprocessor Definitions

Then remove here OPENCV;

Recompile.

And try to train.

  1. If it doesn't help, try to compile darknet_no_gpu.sln And try to train.

Do these 3 steps, and we will see where is a problem (in training dataset, OpenCV, CUDA/GPU, something else...).

syyao98 commented 5 years ago

@AlexeyAB First, thank you for the reply! I am sure that i use the latest version of this repository and didn't change anything in darknet/src and after build i modify some cfg files to train. So far the situation is:

  1. Build without opencv image Now i have 2 folder e:/darknet/build/darknet/x64 for the complete version `e:/darkdebug/build/darknet/x64' for the version without opencv

Both version with data/voc.data specify the image path yolo-voc.cfg with random=0, batch=64 subdivisions=8 darknet19_448.conv.23 in the root dir(./x64) Complete version: Crash: darknet.exe detector train data/voc.data yolo-voc.cfg darknet19_448.conv.23 -dont_show Crash: darknet.exe detector train data/voc.data yolo-voc.cfg darknet19_448.conv.23 image No opencv version: Success: darknet.exe detector train data/voc.data yolo-voc.cfg darknet19_448.conv.23 image Observed for 100 iterations, nothing happens for god sake, got this: image Also set random=1 It seems 6G GPU memory is enough for this randomly resize training? image

Addtional

I notice that: https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L146-L154 If opencv is defined, args.threads will always be 3 whether or not the dont_show flag been used I'm not familiar with the darknet source code or the thread things. I'm not sure this args.threads is the num of the thread in CPU/GPU? If so, how do we use less thread when we defined opencv? It will at least be slower with less thread? If this is not the core of the problem, again: I use opencv3.2 && CUDA 10 && cudnn7.4.1.5 May be some incompatible between these version or with my nvidia 970mobile graphic card?

Also want to know what do i lose without opencv What i know is the cam demo and the real time learning curve. Anything else? Maybe i should try opencv3.4.0 tomorrow.

Finnally, thank you for being so kind and helpful! return Happy New Year~ :confetti_ball: ;

AlexeyAB commented 5 years ago

@syyao98 When did you download your Darknet code? Try to download the latest version. Also I tested Darknet on GCC and MSVS, but I didn't test it on MinGW.

syyao98 commented 5 years ago

@syyao98 When did you download your Darknet code? Try to download the latest version. Also I tested Darknet on GCC and MSVS, but I didn't test it on MinGW.

It's the opencv issue! Update my last comment later!

syyao98 commented 5 years ago

@syyao98 When did you download your Darknet code? Try to download the latest version. Also I tested Darknet on GCC and MSVS, but I didn't test it on MinGW.

I use MinGW for convenience, like wget tar cat And i can run things like get_coco_dataset.sh directly on Windows Everthing over was build on MSVS2015

BTW, your font of your command prompt is awesome. May i ask what is that?

AlexeyAB commented 5 years ago

It seems that you installed OpenCV incorrectly, or you used different CUDA-versions for OpenCV and Darknet, or you have several versions of OpenCV, or something else. Try to use exactly this OpenCV 3.3.0 installer, will it solve your problem? https://sourceforge.net/projects/opencvlibrary/files/opencv-win/3.3.0/opencv-3.3.0-vc14.exe/download

These lines don't cause an issue: https://github.com/AlexeyAB/darknet/blob/08f0f80b66f02b7892a05f4c2ff3dc7c1c6d128b/src/detector.c#L146-L155

https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L146-L154 If opencv is defined, args.threads will always be 3 whether or not the dont_show flag been used I'm not familiar with the darknet source code or the thread things. I'm not sure this args.threads is the num of the thread in CPU/GPU? If so, how do we use less thread when we defined opencv? It will at least be slower with less thread?

This is the number of CPU threads.

If you use original Darknet https://github.com/pjreddie/darknet on Amazon EC2 (p3.2xlarge - Tesla V100) or DGX2, then CPU-utilization will be ~100% for all cores, but GPU-utilization will be < 50%, and training will go very slow. When I added Tensor Cores support - training was not faster, due to data augmentation (on CPU) limitation. So I added data augmentation using OpenCV.

If this repo of Darknet https://github.com/AlexeyAB/darknet is compiled with OpenCV, then the data augmentation will use OpenCV optimized (AVX/Multi-threading) functions that are ~4x times faster, so it will remove CPU-bottleneck even if you use the fastest GPU with Tensor Cores (CUDNN_HALF=1) like Tesla V100 or DGX-2, about Tensor Cores: https://github.com/AlexeyAB/darknet/issues/407

So you can built this repository with GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 for the faster Training and Detection:


More about improvements: https://github.com/AlexeyAB/darknet#improvements-in-this-repository

improved performance 3.5 X times of data augmentation for training (using OpenCV SSE/AVX functions instead of hand-written functions) - removes bottleneck for training on multi-GPU or GPU Volta

It doesn't require more than 3 CPU-threads per 1 Tesla V100 if you use Intel CPU. And training goes faster with 3 threads than with 2 or 4 threads on Amazon EC2: https://github.com/AlexeyAB/darknet/issues/1380#issuecomment-412333942

Only if you use AMD Ryzen CPU, then you should use more CPU-threads, args.threads = 12 * ngpus;


BTW, your font of your command prompt is awesome. May i ask what is that?

It is something like: Raster fonts (dot fonts)


Finnally, thank you for being so kind and helpful! return Happy New Year~ confetti_ball ;

Thanks! Happy New Year!