VOC training crash at the very begining on Windows

syyao98 commented 5 years ago

Hardware

Intel i7-6700HQ Nvidia 970M 6G 16G RAM

Software

Windows 10 VS2015 CUDA10 cudnn 7.4.1 opencv 3.2

Build option

Open darknet.sln -> build and successed(Release and x64)

Problem

Followed this instruction Train a Classifier on CIFAR-10, successed training the classifier. So classifier.c works well i guess. But followed the VOC training instruction. Either AlexeyAB's or pjreddie's. Training crashed at the very begining.

Troubleshooting

I read some related issues and did some troubleshooting, couldn't locate the error. cfg/voc.data

classes=20
train  = data/voc/train.txt
valid  = data/voc/2007_test.txt
names = data/voc.names
backup = backup/

[x] UTF-8 and LF in all txt file, checked
[x] Use absolute path in cfg/voc.data
[x] Use a subset with 100 imgs of VOC
[x] No empty line in txt file
[x] Set all random to 0
[x] Change batch and subdivisions to different combinations
[x] No 0.0 bounding box All failed, and still crash at the very begining
Screenshot

followed the instruction, use pre-trained weight, crashed: And the free space of my C: disk becomes 7G(from 20G) Even no GPU usage, the training strucked use the bash script(MinGW) ./darknet.exe detector train cfg/voc.data cfg/yolov2-voc.cfg weights/darknet19_448.conv.23 It crashed and threw me a segmentation fault.

My Guess

I guess the there might be sth wrong when loading the data(failed at the 1st iteration) classifier.c function well, and large net function well, i modify the darknet53, last conv filters: 1000 -> 10 and use it to train the cifar10 dataset, it works. And when it comes to detector.c it just crashed at the very begining. And i notice that the training process just consumed my C disk memory then get stuck or crash with a segmentation fault. So it may be something wrong when fetching the data or allocate the memory (or CUDA memory?) for the data? Also darknet.exe detector demo data/voc.data cfg/yolov2-voc.cfg weights/yolo-voc.weights these test or demo scripts function well and gives a good result(using the official trained weights file).

AlexeyAB commented 5 years ago

@syyao98 Hi,

Can you successfully get mAP (accuracy) for default trained model https://pjreddie.com/media/files/yolov2-voc.weights on Pascal VOC dataset? https://github.com/AlexeyAB/darknet#how-to-calculate-map-on-pascalvoc-2007

Using this command: darknet.exe detector map data/voc.data cfg/yolov2-voc.cfg yolo-voc.weights

Try to copy files to the same directory where is darknet.exe:

opencv_world340.dll, opencv_ffmpeg340_64.dll, cudnn64_7.dll, cublas64_100.dll, cudart64_100.dll, cufft64_100.dll, cusolver64_100.dll, cusparse64_100.dll pthreadVC2.dll, pthreadGC2.dll,

I can successfully train yolov2_voc.cfg on PascalVOC dataset: darknet.exe detector train data/voc.data cfg/yolov2-voc.cfg darknet19_448.conv.23 -dont_show

Windows 7 x64
MSVS 2015
CUDA 10.0
cuDNN 7.3.1 for CUDA 10.0
OpenCV 3.3.0

File data/voc.data:

classes= 20
train  = data/train_voc.txt
valid  = data/2007_test.txt
#difficult = data/difficult_2007_test.txt
names = data/voc.names
backup = backup/

File data/train_voc.txt:

E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000012.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000017.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000023.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000026.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000032.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000033.jpg
...

syyao98 commented 5 years ago

@AlexeyAB

With nothing changed

Using the script calc_mAP_voc_py.cmd It looks like the eval procedure function well Using darknet.exe detector map cfg/voc.data cfg/yolov2-voc.cfg weights/yolo-voc.weights (my weights file all stored at x64/weights and i use cfg/voc.data instead of data/voc.data), Get the result below:

My opencv version may be 3.2? I have opencv_ffmpeg320_64 and opencv_world320.dll with darknet.exe I can use darknet.exe detector demo to use yolov2 or yolov3 weight to do the real time detection(about 12FPS), this may use opencv? So i guess opencv may not be the issue? And i don't have cudnn64_7.dll,cublas64_100.dll, cudart64_100.dll, cufft64_100.dll, cusolver64_100.dll, cusparse64_100.dll at the root_dir ./x64, i just set the environment variable for the CUDA.

With library files be copied

It just crashed again... :cry:

AlexeyAB commented 5 years ago

@syyao98 This is very strange.

Do you use the lates version from this repository?
What is the message do you get each time?

Did you change anything in the source code?

Try to set train= the same as the current valid= in the voc.data file:
```
train  = data/voc/2007_test.txt
valid  = data/voc/2007_test.txt
```
And try to train.
If it doesn't help - try to compile Darknet without OpenCV: Open darknet.sln in the MSVS, (right click on project) -> properties -> C/C++ -> Preprocessor -> Preprocessor Definitions

Then remove here OPENCV;

Recompile.

And try to train.

If it doesn't help, try to compile darknet_no_gpu.sln And try to train.

Do these 3 steps, and we will see where is a problem (in training dataset, OpenCV, CUDA/GPU, something else...).

syyao98 commented 5 years ago

@AlexeyAB First, thank you for the reply! I am sure that i use the latest version of this repository and didn't change anything in darknet/src and after build i modify some cfg files to train. So far the situation is:

[x] Train a samll classifier on CIFAR10, using absolute path, relative path or symbolink dir, all successed
[x] Train a large classifier on CIFAR10(modified darknet53), successed
[x] Use trained weight to do the real time detection via cam(opencv API?), successed
[x] Use the trained weight to do the evaluation, successed
[ ] Use the same .data file, modify the cfg file(batch things), failed on VOC(either on pre-trained weight or from the scratch)
[ ] Failed on a very small subset of VOC
[ ] Failed on COCO training Very strange and almost drive me mad :anger: Since evaluation passed while training failed, It seems the detector can't deal with batch that grater than 1? Note that i have MinGW installed With the windows command prompt, it just crashed with no error information But in git bash It says i got a segmentation fault at the Line1. Since i only have 1 line in my bash script and it's the line to do the ./darknet detector train thing. I guess i get a segmentation fault when fetching batches of data.(And she didn't tell me that and just crashed again and agian.
As you adviced
1. Set train= the same as the current valid= in the voc.data file:
```
train = data/voc/2007_test.txt
valid  = data/voc/2007_test.txt
```
  And in cfg file:
```
[net]
# Testing
#batch=1
#subdivisions=1
# Training
batch=64
subdivisions=64
```
  Train: darknet.exe detector train cfg/voc.data cfg/yolov2-voc.cfg -dont_show Crashed after printing the Learning Rate(as usual) And with cfg file modified:
```
batch=1
subdivisions=1
```
  Do the same training again, gives me the error: And then get stuck(So it may be sth wroing in the train_function() in detector.c?) And after pressing CTRL+C: The command prompt just shows me the Learning Rate things then quit. Then do the evaluation with nothing changed in the same prompt winddows darknet.exe detector map cfg/voc.data cfg/yolov2-voc.cfg weights/yolo-voc.weights Funtion well as before.

Build without opencv Now i have 2 folder e:/darknet/build/darknet/x64 for the complete version `e:/darkdebug/build/darknet/x64' for the version without opencv

Both version with data/voc.data specify the image path yolo-voc.cfg with random=0, batch=64 subdivisions=8 darknet19_448.conv.23 in the root dir(./x64) Complete version: Crash: darknet.exe detector train data/voc.data yolo-voc.cfg darknet19_448.conv.23 -dont_show Crash: darknet.exe detector train data/voc.data yolo-voc.cfg darknet19_448.conv.23 No opencv version: Success: darknet.exe detector train data/voc.data yolo-voc.cfg darknet19_448.conv.23 Observed for 100 iterations, nothing happens for god sake, got this: Also set random=1 It seems 6G GPU memory is enough for this randomly resize training?

Addtional

I notice that: https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L146-L154 If opencv is defined, args.threads will always be 3 whether or not the dont_show flag been used I'm not familiar with the darknet source code or the thread things. I'm not sure this args.threads is the num of the thread in CPU/GPU? If so, how do we use less thread when we defined opencv? It will at least be slower with less thread? If this is not the core of the problem, again: I use opencv3.2 && CUDA 10 && cudnn7.4.1.5 May be some incompatible between these version or with my nvidia 970mobile graphic card?

Also want to know what do i lose without opencv What i know is the cam demo and the real time learning curve. Anything else? Maybe i should try opencv3.4.0 tomorrow.

Finnally, thank you for being so kind and helpful! return Happy New Year~ :confetti_ball: ;

AlexeyAB commented 5 years ago

@syyao98 When did you download your Darknet code? Try to download the latest version. Also I tested Darknet on GCC and MSVS, but I didn't test it on MinGW.

syyao98 commented 5 years ago

@syyao98 When did you download your Darknet code? Try to download the latest version. Also I tested Darknet on GCC and MSVS, but I didn't test it on MinGW.

It's the opencv issue! Update my last comment later!

syyao98 commented 5 years ago

@syyao98 When did you download your Darknet code? Try to download the latest version. Also I tested Darknet on GCC and MSVS, but I didn't test it on MinGW.

I use MinGW for convenience, like wget tar cat And i can run things like get_coco_dataset.sh directly on Windows Everthing over was build on MSVS2015

BTW, your font of your command prompt is awesome. May i ask what is that?

AlexeyAB commented 5 years ago

It seems that you installed OpenCV incorrectly, or you used different CUDA-versions for OpenCV and Darknet, or you have several versions of OpenCV, or something else. Try to use exactly this OpenCV 3.3.0 installer, will it solve your problem? https://sourceforge.net/projects/opencvlibrary/files/opencv-win/3.3.0/opencv-3.3.0-vc14.exe/download

These lines don't cause an issue: https://github.com/AlexeyAB/darknet/blob/08f0f80b66f02b7892a05f4c2ff3dc7c1c6d128b/src/detector.c#L146-L155

https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L146-L154 If opencv is defined, args.threads will always be 3 whether or not the dont_show flag been used I'm not familiar with the darknet source code or the thread things. I'm not sure this args.threads is the num of the thread in CPU/GPU? If so, how do we use less thread when we defined opencv? It will at least be slower with less thread?

This is the number of CPU threads.

If you use original Darknet https://github.com/pjreddie/darknet on Amazon EC2 (p3.2xlarge - Tesla V100) or DGX2, then CPU-utilization will be ~100% for all cores, but GPU-utilization will be < 50%, and training will go very slow. When I added Tensor Cores support - training was not faster, due to data augmentation (on CPU) limitation. So I added data augmentation using OpenCV.

If this repo of Darknet https://github.com/AlexeyAB/darknet is compiled with OpenCV, then the data augmentation will use OpenCV optimized (AVX/Multi-threading) functions that are ~4x times faster, so it will remove CPU-bottleneck even if you use the fastest GPU with Tensor Cores (CUDNN_HALF=1) like Tesla V100 or DGX-2, about Tensor Cores: https://github.com/AlexeyAB/darknet/issues/407

So you can built this repository with GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 for the faster Training and Detection:

on Windows: open \darknet.sln -> (right click on project) -> properties -> C/C++ -> Preprocessor -> Preprocessor Definitions, and add at the beginning of line: CUDNN_HALF;
on Linux: open Makefile and set CUDNN_HALF=1: https://github.com/AlexeyAB/darknet/blob/140333977cea0ba9e384cd38fd01013a8915ef60/Makefile#L3

More about improvements: https://github.com/AlexeyAB/darknet#improvements-in-this-repository

improved performance 3.5 X times of data augmentation for training (using OpenCV SSE/AVX functions instead of hand-written functions) - removes bottleneck for training on multi-GPU or GPU Volta

It doesn't require more than 3 CPU-threads per 1 Tesla V100 if you use Intel CPU. And training goes faster with 3 threads than with 2 or 4 threads on Amazon EC2: https://github.com/AlexeyAB/darknet/issues/1380#issuecomment-412333942

Only if you use AMD Ryzen CPU, then you should use more CPU-threads, args.threads = 12 * ngpus;

BTW, your font of your command prompt is awesome. May i ask what is that?

It is something like: Raster fonts (dot fonts)

Finnally, thank you for being so kind and helpful! return Happy New Year~ confetti_ball ;

Thanks! Happy New Year!

AlexeyAB / darknet