Open syyao98 opened 5 years ago
@syyao98 Hi,
Using this command:
darknet.exe detector map data/voc.data cfg/yolov2-voc.cfg yolo-voc.weights
darknet.exe
:opencv_world340.dll, opencv_ffmpeg340_64.dll, cudnn64_7.dll, cublas64_100.dll, cudart64_100.dll, cufft64_100.dll, cusolver64_100.dll, cusparse64_100.dll pthreadVC2.dll, pthreadGC2.dll,
I can successfully train yolov2_voc.cfg
on PascalVOC dataset:
darknet.exe detector train data/voc.data cfg/yolov2-voc.cfg darknet19_448.conv.23 -dont_show
File data/voc.data
:
classes= 20
train = data/train_voc.txt
valid = data/2007_test.txt
#difficult = data/difficult_2007_test.txt
names = data/voc.names
backup = backup/
File data/train_voc.txt
:
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000012.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000017.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000023.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000026.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000032.jpg
E:\VOC2007_2012\VOCtrainval_11-May-2012/VOCdevkit/VOC2007/JPEGImages/000033.jpg
...
@AlexeyAB
Using the script calc_mAP_voc_py.cmd
It looks like the eval procedure function well
Using darknet.exe detector map cfg/voc.data cfg/yolov2-voc.cfg weights/yolo-voc.weights
(my weights file all stored at x64/weights and i use cfg/voc.data instead of data/voc.data),
Get the result below:
My opencv version may be 3.2? I have opencv_ffmpeg320_64
and opencv_world320.dll
with darknet.exe
I can use darknet.exe detector demo
to use yolov2 or yolov3 weight to do the real time detection(about 12FPS), this may use opencv? So i guess opencv may not be the issue?
And i don't have cudnn64_7.dll
,cublas64_100.dll
, cudart64_100.dll
, cufft64_100.dll
, cusolver64_100.dll
, cusparse64_100.dll
at the root_dir ./x64, i just set the environment variable for the CUDA.
It just crashed again... :cry:
@syyao98 This is very strange.
Try to set train=
the same as the current valid=
in the voc.data
file:
train = data/voc/2007_test.txt
valid = data/voc/2007_test.txt
And try to train.
If it doesn't help - try to compile Darknet without OpenCV:
Open darknet.sln
in the MSVS, (right click on project) -> properties -> C/C++ -> Preprocessor -> Preprocessor Definitions
Then remove here OPENCV;
Recompile.
And try to train.
darknet_no_gpu.sln
And try to train.Do these 3 steps, and we will see where is a problem (in training dataset, OpenCV, CUDA/GPU, something else...).
@AlexeyAB
First, thank you for the reply!
I am sure that i use the latest version of this repository and didn't change anything in darknet/src
and after build i modify some cfg files to train.
So far the situation is:
.data
file, modify the cfg
file(batch things), failed on VOC(either on pre-trained weight or from the scratch)./darknet detector train
thing. I guess i get a segmentation fault when fetching batches of data.(And she didn't tell me that and just crashed again and agian.
train=
the same as the current valid=
in the voc.data
file:
train = data/voc/2007_test.txt
valid = data/voc/2007_test.txt
And in cfg
file:
[net]
# Testing
#batch=1
#subdivisions=1
# Training
batch=64
subdivisions=64
Train: darknet.exe detector train cfg/voc.data cfg/yolov2-voc.cfg -dont_show
Crashed after printing the Learning Rate(as usual)
And with cfg
file modified:
batch=1
subdivisions=1
Do the same training again, gives me the error:
And then get stuck(So it may be sth wroing in the train_function()
in detector.c
?)
And after pressing CTRL+C
:
The command prompt just shows me the Learning Rate things then quit.
Then do the evaluation with nothing changed in the same prompt winddows
darknet.exe detector map cfg/voc.data cfg/yolov2-voc.cfg weights/yolo-voc.weights
Funtion well as before.
e:/darknet/build/darknet/x64
for the complete version
`e:/darkdebug/build/darknet/x64' for the version without opencvBoth version with data/voc.data
specify the image path
yolo-voc.cfg
with random=0
, batch=64
subdivisions=8
darknet19_448.conv.23
in the root dir(./x64
)
Complete version:
Crash: darknet.exe detector train data/voc.data yolo-voc.cfg darknet19_448.conv.23 -dont_show
Crash: darknet.exe detector train data/voc.data yolo-voc.cfg darknet19_448.conv.23
No opencv version:
Success: darknet.exe detector train data/voc.data yolo-voc.cfg darknet19_448.conv.23
Observed for 100 iterations, nothing happens for god sake, got this:
Also set random=1
It seems 6G GPU memory is enough for this randomly resize training?
I notice that:
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L146-L154
If opencv is defined, args.threads will always be 3 whether or not the dont_show
flag been used
I'm not familiar with the darknet source code or the thread things.
I'm not sure this args.threads
is the num of the thread in CPU/GPU?
If so, how do we use less thread when we defined opencv? It will at least be slower with less thread?
If this is not the core of the problem, again:
I use opencv3.2 && CUDA 10 && cudnn7.4.1.5
May be some incompatible between these version or with my nvidia 970mobile graphic card?
Also want to know what do i lose without opencv What i know is the cam demo and the real time learning curve. Anything else? Maybe i should try opencv3.4.0 tomorrow.
Finnally, thank you for being so kind and helpful! return Happy New Year~ :confetti_ball: ;
@syyao98 When did you download your Darknet code? Try to download the latest version. Also I tested Darknet on GCC and MSVS, but I didn't test it on MinGW.
@syyao98 When did you download your Darknet code? Try to download the latest version. Also I tested Darknet on GCC and MSVS, but I didn't test it on MinGW.
It's the opencv issue! Update my last comment later!
@syyao98 When did you download your Darknet code? Try to download the latest version. Also I tested Darknet on GCC and MSVS, but I didn't test it on MinGW.
I use MinGW for convenience, like wget
tar
cat
And i can run things like get_coco_dataset.sh
directly on Windows
Everthing over was build on MSVS2015
BTW, your font of your command prompt is awesome. May i ask what is that?
It seems that you installed OpenCV incorrectly, or you used different CUDA-versions for OpenCV and Darknet, or you have several versions of OpenCV, or something else. Try to use exactly this OpenCV 3.3.0 installer, will it solve your problem? https://sourceforge.net/projects/opencvlibrary/files/opencv-win/3.3.0/opencv-3.3.0-vc14.exe/download
These lines don't cause an issue: https://github.com/AlexeyAB/darknet/blob/08f0f80b66f02b7892a05f4c2ff3dc7c1c6d128b/src/detector.c#L146-L155
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L146-L154 If opencv is defined, args.threads will always be 3 whether or not the dont_show flag been used I'm not familiar with the darknet source code or the thread things. I'm not sure this args.threads is the num of the thread in CPU/GPU? If so, how do we use less thread when we defined opencv? It will at least be slower with less thread?
This is the number of CPU threads.
If you use original Darknet https://github.com/pjreddie/darknet on Amazon EC2 (p3.2xlarge - Tesla V100) or DGX2, then CPU-utilization will be ~100% for all cores, but GPU-utilization will be < 50%, and training will go very slow. When I added Tensor Cores support - training was not faster, due to data augmentation (on CPU) limitation. So I added data augmentation using OpenCV.
If this repo of Darknet https://github.com/AlexeyAB/darknet is compiled with OpenCV, then the data augmentation will use OpenCV optimized (AVX/Multi-threading) functions that are ~4x times faster, so it will remove CPU-bottleneck even if you use the fastest GPU with Tensor Cores (CUDNN_HALF=1) like Tesla V100 or DGX-2, about Tensor Cores: https://github.com/AlexeyAB/darknet/issues/407
So you can built this repository with GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1
for the faster Training and Detection:
on Windows: open \darknet.sln
-> (right click on project) -> properties -> C/C++ -> Preprocessor -> Preprocessor Definitions, and add at the beginning of line: CUDNN_HALF;
on Linux: open Makefile
and set CUDNN_HALF=1
: https://github.com/AlexeyAB/darknet/blob/140333977cea0ba9e384cd38fd01013a8915ef60/Makefile#L3
More about improvements: https://github.com/AlexeyAB/darknet#improvements-in-this-repository
improved performance 3.5 X times of data augmentation for training (using OpenCV SSE/AVX functions instead of hand-written functions) - removes bottleneck for training on multi-GPU or GPU Volta
It doesn't require more than 3 CPU-threads per 1 Tesla V100 if you use Intel CPU. And training goes faster with 3 threads than with 2 or 4 threads on Amazon EC2: https://github.com/AlexeyAB/darknet/issues/1380#issuecomment-412333942
Only if you use AMD Ryzen CPU, then you should use more CPU-threads, args.threads = 12 * ngpus;
BTW, your font of your command prompt is awesome. May i ask what is that?
It is something like: Raster fonts (dot fonts)
Finnally, thank you for being so kind and helpful! return Happy New Year~ confetti_ball ;
Thanks! Happy New Year!
Hardware
Intel i7-6700HQ Nvidia 970M 6G 16G RAM
Software
Windows 10 VS2015 CUDA10 cudnn 7.4.1 opencv 3.2
Build option
Open
darknet.sln
->build
and successed(Release and x64)Problem
Followed this instruction Train a Classifier on CIFAR-10, successed training the classifier. So
classifier.c
works well i guess. But followed the VOC training instruction. Either AlexeyAB's or pjreddie's. Training crashed at the very begining.Troubleshooting
I read some related issues and did some troubleshooting, couldn't locate the error.
cfg/voc.data
cfg/voc.data
random
to 0batch
andsubdivisions
to different combinationsScreenshot
followed the instruction, use pre-trained weight, crashed: And the free space of my
C:
disk becomes 7G(from 20G) Even no GPU usage, the training strucked use the bash script(MinGW)./darknet.exe detector train cfg/voc.data cfg/yolov2-voc.cfg weights/darknet19_448.conv.23
It crashed and threw me a segmentation fault.My Guess
I guess the there might be sth wrong when loading the data(failed at the 1st iteration)
classifier.c
function well, and large net function well, i modify the darknet53, last conv filters: 1000 -> 10 and use it to train the cifar10 dataset, it works. And when it comes todetector.c
it just crashed at the very begining. And i notice that the training process just consumed my C disk memory then get stuck or crash with a segmentation fault. So it may be something wrong when fetching the data or allocate the memory (or CUDA memory?) for the data? Alsodarknet.exe detector demo data/voc.data cfg/yolov2-voc.cfg weights/yolo-voc.weights
these test or demo scripts function well and gives a good result(using the official trained weights file).