Closed visha-l closed 7 years ago
At first, make sure that everything is compiled without errors.
make
, when you try to make Darknet Yolo with flags GPU=1 CUDNN=1.nvidia-smi
This version of cuda
is installed on my system.
CUDA Version 8.0.61
.
This version of cuDNN
is installed on my system.
#define CUDNN_MAJOR 6
#define CUDNN_MINOR 0
#define CUDNN_PATCHLEVEL 21
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
I think it means cuDNN version 6.0.21
nvidia-smi
command is giving me ::
Fri Jun 2 10:16:08 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 50C P0 56W / 149W | 0MiB / 11439MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
make is giving me this ::
gcc -DGPU -I/usr/local/cuda-8.0/include/ -DCUDNN -Wall -Wfatal-errors -Ofast -DGPU -DCUDNN -c ./src/gemm.c -o obj/gemm.o
In file included from ./src/gemm.c:3:0:
./src/cuda.h:10:26: fatal error: cuda_runtime.h: No such file or directory
#include "cuda_runtime.h"
^
compilation terminated.
make: *** [obj/gemm.o] Error 1
In file included from ./src/gemm.c:3:0:
./src/cuda.h:15:19: fatal error: cudnn.h: No such file or directory
#include "cudnn.h"
^
compilation terminated.
make: *** [obj/gemm.o] Error 1
Now getting this error.
My Makefile
is ::
GPU=1
CUDNN=1
OPENCV=0
DEBUG=0
ARCH= -gencode arch=compute_20,code=[sm_20,sm_21] \
-gencode arch=compute_30,code=sm_30 \
-gencode arch=compute_35,code=sm_35 \
-gencode arch=compute_50,code=[sm_50,compute_50] \
-gencode arch=compute_52,code=[sm_52,compute_52]
# This is what I use, uncomme
nt if you know your arch and want to specify
# ARCH= -gencode arch=compute_52,code=compute_52
VPATH=./src/
EXEC=darknet
OBJDIR=./obj/
CC=gcc
NVCC=nvcc
OPTS=-Ofast
LDFLAGS= -lm -pthread
COMMON=
CFLAGS=-Wall -Wfatal-errors
ifeq ($(DEBUG), 1)
OPTS=-O0 -g
endif
CFLAGS+=$(OPTS)
ifeq ($(OPENCV), 1)
COMMON+= -DOPENCV
CFLAGS+= -DOPENCV
LDFLAGS+= `pkg-config --libs opencv`
COMMON+= `pkg-config --cflags opencv`
endif
ifeq ($(GPU), 1)
COMMON+= -DGPU -I/usr/local/cuda-7.0/include/
CFLAGS+= -DGPU
LDFLAGS+= -L/usr/local/cuda-7.0/lib64 -lcuda -lcudart -lcublas -lcurand
endif
ifeq ($(CUDNN), 1)
COMMON+= -DCUDNN
CFLAGS+= -DCUDNN
LDFLAGS+= -lcudnn
endif
OBJ=gemm.o utils.o cuda.o deconvolutional_layer.o convolutional_layer.o list.o image.o activations.o im2col.o col2im.o blas.o crop_layer.o dropout_layer.o maxpool_layer.o softmax_layer.o data.o matrix.o network.o connected_layer.o cost_layer.o parser.o option_list.o darknet.o detection_layer.o captcha.o route_layer.o writing.o box.o nightmare.o normalization_layer.o avgpool_layer.o coco.o dice.o yolo.o detector.o layer.o compare.o regressor.o classifier.o local_layer.o swag.o shortcut_layer.o activation_layer.o rnn_layer.o gru_layer.o rnn.o rnn_vid.o crnn_layer.o demo.o tag.o cifar.o go.o batchnorm_layer.o art.o region_layer.o reorg_layer.o lsd.o super.o voxel.o tree.o
ifeq ($(GPU), 1)
LDFLAGS+= -lstdc++
OBJ+=convolutional_kernels.o deconvolutional_kernels.o activation_kernels.o im2col_kernels.o col2im_kernels.o blas_kernels.o crop_layer_kernels.o dropout_layer_kernels.o maxpool_layer_kernels.o network_kernels.o avgpool_layer_kernels.o
endif
OBJS = $(addprefix $(OBJDIR), $(OBJ))
DEPS = $(wildcard src/*.h) Makefile
all: obj backup results $(EXEC)
$(EXEC): $(OBJS)
$(CC) $(COMMON) $(CFLAGS) $^ -o $@ $(LDFLAGS)
$(OBJDIR)%.o: %.c $(DEPS)
$(CC) $(COMMON) $(CFLAGS) -c $< -o $@
$(OBJDIR)%.o: %.cu $(DEPS)
$(NVCC) $(ARCH) $(COMMON) --compiler-options "$(CFLAGS)" -c $< -o $@
obj:
mkdir -p obj
backup:
mkdir -p backup
results:
mkdir -p results
.PHONY: clean
clean:
rm -rf $(OBJS) $(EXEC)
Am I suppose to make some changes in Makefile
for giving path of cudnn
nvcc -gencode arch=compute_20,code=[sm_20,sm_21] -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=[sm_50,compute_50] -gencode arch=compute_52,code=[sm_52,compute_52] -DGPU -I/usr/local/cuda-7.0/include/ --compiler-options "-Wall -Wfatal-errors -Ofast -DGPU" -c ./src/convolutional_kernels.cu -o obj/convolutional_kernels.o
nvcc fatal : 'sm_21]' is not in 'keyword=value' format
make: *** [obj/convolutional_kernels.o] Error 255
why getting this error?
ubuntu@ip-10-0-0-226:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Wed_Jul_17_18:36:13_PDT_2013
Cuda compilation tools, release 5.5, V5.5.0
Please help me out.
ubuntu@ip-10-0-0-226:~/darknet$ ./darknet detector train Yolo_mark-master/x64/Release/data/obj.data Yolo_mark-master/x64/Release/yolo-obj.cfg darknet19_448.conv.23
yolo-obj
layer filters size input output
0 conv 32 3 x 3 / 1 1088 x1088 x 3 -> 1088 x1088 x 32
1 max 2 x 2 / 2 1088 x1088 x 32 -> 544 x 544 x 32
2 conv 64 3 x 3 / 1 544 x 544 x 32 -> 544 x 544 x 64
3 max 2 x 2 / 2 544 x 544 x 64 -> 272 x 272 x 64
4 conv 128 3 x 3 / 1 272 x 272 x 64 -> 272 x 272 x 128
5 conv 64 1 x 1 / 1 272 x 272 x 128 -> 272 x 272 x 64
6 conv 128 3 x 3 / 1 272 x 272 x 64 -> 272 x 272 x 128
7 max 2 x 2 / 2 272 x 272 x 128 -> 136 x 136 x 128
8 CUDA Error: out of memory
darknet: ./src/cuda.c:36: check_error: Assertion `0' failed.
Aborted (core dumped)
Show first 10 lines of Yolo_mark-master/x64/Release/yolo-obj.cfg
And try to set subdivisions=16
or 32 in yolo-obj.cfg
: https://github.com/AlexeyAB/darknet/blob/master/cfg/yolo-voc.2.0.cfg#L3
Also show values of parameters from yolo-obj.cfg
:
[net] batch=1 subdivisions=8 height=1088 width=1088 channels=3 momentum=0.9 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1
filters=125 classes=20 random=1
now giving this error ?
ubuntu@ip-10-0-0-226:~/darknet$ ./darknet detector train Yolo_mark-master/x64/Release/data/obj.data Yolo_mark-master/x64/Release/yolo-obj.cfg darknet19_448.conv.23
./darknet: error while loading shared libraries: libcudart.so.7.0: cannot open shared object file: No such file or directory
As said here: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
- change line batch to batch=64
- change line subdivisions to subdivisions=8
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
Sir it is stilling giving the previous error .
ubuntu@ip-10-0-0-226:~/darknet$ ./darknet detector train Yolo_mark-master/x64/Release/data/obj.data Yolo_mark-master/x64/Release/yolo-obj.cfg darknet19_448.conv.23
yolo-obj
layer filters size input output
0 conv 32 3 x 3 / 1 1088 x1088 x 3 -> 1088 x1088 x 32
1 max 2 x 2 / 2 1088 x1088 x 32 -> 544 x 544 x 32
2 conv 64 3 x 3 / 1 544 x 544 x 32 -> 544 x 544 x 64
3 max 2 x 2 / 2 544 x 544 x 64 -> 272 x 272 x 64
4 conv 128 3 x 3 / 1 272 x 272 x 64 -> 272 x 272 x 128
5 conv 64 1 x 1 / 1 272 x 272 x 128 -> 272 x 272 x 64
6 conv 128 3 x 3 / 1 272 x 272 x 64 -> 272 x 272 x 128
7 max 2 x 2 / 2 272 x 272 x 128 -> 136 x 136 x 128
8 CUDA Error: out of memory
darknet: ./src/cuda.c:36: check_error: Assertion `0' failed.
Aborted (core dumped)
I changed my Yolo_mark-master/x64/Release/yolo-obj.cfg
[net]
batch=64
subdivisions=8
height=1088
width=1088
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
learning_rate=0.0001
max_batches = 45000
policy=steps
steps=100,25000,35000
scales=10,.1,.1
[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky
[maxpool]
size=2
stride=2
[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky
[maxpool]
size=2
even when i set Batch
32 it was giving error.
If I understand it correctly then batch in .cfg
file is the number of images that It will train in each iteration. I am using p2.xlarge
which has count of gpu=1
, and when i start my detector training with batch=32, subdivisions=8
it give core dump
, and when I decreased this batch
value to batch=16
it runs it 579 iterations but then it stopped with error It can not load backup//yolo-obj_580.weights
file. and also when I checked the weight
file like yolo-obj_300.wieghts
it contains some bytes 256MB
but the other files above 300 were not containing the data. even file yolo-obj_350.weights
file is not containing data.
And sir, when I tested my image with this file it gives following output.
vishal@user756:~/darknet$ ./darknet detect Yolo_mark-master/x64/Release/yolo-obj.cfg /home/vishal/Desktop/yolo-obj_300.weights /home/vishal/CARS/Audi/audi_1.jpg -thres 0
layer filters size input output
0 conv 32 3 x 3 / 1 1088 x1088 x 3 -> 1088 x1088 x 32
1 max 2 x 2 / 2 1088 x1088 x 32 -> 544 x 544 x 32
2 conv 64 3 x 3 / 1 544 x 544 x 32 -> 544 x 544 x 64
3 max 2 x 2 / 2 544 x 544 x 64 -> 272 x 272 x 64
4 conv 128 3 x 3 / 1 272 x 272 x 64 -> 272 x 272 x 128
5 conv 64 1 x 1 / 1 272 x 272 x 128 -> 272 x 272 x 64
6 conv 128 3 x 3 / 1 272 x 272 x 64 -> 272 x 272 x 128
7 max 2 x 2 / 2 272 x 272 x 128 -> 136 x 136 x 128
8 conv 256 3 x 3 / 1 136 x 136 x 128 -> 136 x 136 x 256
9 conv 128 1 x 1 / 1 136 x 136 x 256 -> 136 x 136 x 128
10 conv 256 3 x 3 / 1 136 x 136 x 128 -> 136 x 136 x 256
11 max 2 x 2 / 2 136 x 136 x 256 -> 68 x 68 x 256
12 conv 512 3 x 3 / 1 68 x 68 x 256 -> 68 x 68 x 512
13 conv 256 1 x 1 / 1 68 x 68 x 512 -> 68 x 68 x 256
14 conv 512 3 x 3 / 1 68 x 68 x 256 -> 68 x 68 x 512
15 conv 256 1 x 1 / 1 68 x 68 x 512 -> 68 x 68 x 256
16 conv 512 3 x 3 / 1 68 x 68 x 256 -> 68 x 68 x 512
17 max 2 x 2 / 2 68 x 68 x 512 -> 34 x 34 x 512
18 conv 1024 3 x 3 / 1 34 x 34 x 512 -> 34 x 34 x1024
19 conv 512 1 x 1 / 1 34 x 34 x1024 -> 34 x 34 x 512
20 conv 1024 3 x 3 / 1 34 x 34 x 512 -> 34 x 34 x1024
21 conv 512 1 x 1 / 1 34 x 34 x1024 -> 34 x 34 x 512
22 conv 1024 3 x 3 / 1 34 x 34 x 512 -> 34 x 34 x1024
23 conv 1024 3 x 3 / 1 34 x 34 x1024 -> 34 x 34 x1024
24 conv 1024 3 x 3 / 1 34 x 34 x1024 -> 34 x 34 x1024
25 route 16
26 reorg / 2 68 x 68 x 512 -> 34 x 34 x2048
27 route 26 24
28 conv 1024 3 x 3 / 1 34 x 34 x3072 -> 34 x 34 x1024
29 conv 125 1 x 1 / 1 34 x 34 x1024 -> 34 x 34 x 125
30 detection
Loading weights from /home/vishal/Desktop/yolo-obj_300.weights...Done!
Segmentation fault (core dumped)
I am trying to find the logo of the car, so I gather 20 different make cars and took around 25 images per class, and also used yolo_mark
to create txt
file corresponding to each image file.
I created obj.names file which contains the class (name of make of car) in each new line.
chevrolet
honda
hyundai
mahindra
nissan
skoda
tata
toyota
audi
bmw
datsun
fiat
ford
jaguar
maruti-suzuki
mercedes
range-rover
renault
volkswagen
volvo
I created obj.data file which contains.
classes= 20
train = data/train.txt
valid = data/train.txt
names = data/obj.names
backup = backup/
Makefile
starting content .
GPU=1
CUDNN=1
OPENCV=1
DEBUG=0
ARCH= -gencode arch=compute_20,code=[sm_20,sm_21] \
-gencode arch=compute_30,code=sm_30 \
-gencode arch=compute_35,code=sm_35 \
-gencode arch=compute_50,code=[sm_50,compute_50] \
-gencode arch=compute_52,code=[sm_52,compute_52]
ed obj.names file which co
# This is what I use, uncomment if you know your arch and want to specify
# ARCH= -gencode arch=compute_52,code=compute_52
VPATH=./src/
Sir in above comment I have shown you my .cfg
file.
So please answer me how can I solve the problem of segmentation fault which is coming during training as well as during testing which I tested with starting weight
files.
One more thing I want to , why all these multiple weight
files are generating, what is the significance of generating weight file for each Iteration.
Also one more thing that after that error during training It did not save any backup file, so how will use the saved weight
file for retraining.
It will be great help sir.
thanks a lot for all help you did till now, please help me little more
Please help me out .
One more thing sir I want to ask you. If a single weight
file is taking 256MB
,which is created by a single iteration and we are suppose to run it for 1000 iterations at-least then what will be the storage requirement for this number of weight file.
I was using 60GB
storage for this and , It get filled and training stopped at 579th iteration. by giving error that can't load yolo-obj_580.weights
file. and no storage left
.
So, How much storage is required to store these weight
files.
If you use height=1088
and width=1088
then set:
batch=16
subdivisions=16
random=0
Sir but in my case weight
files are generating with digits as suffix from 1 to 500 with increment of 1
so,basically from 0 to 1000 it is generating 1000 weight
files.
What all these terms stand for. Is there any documentation to understand the meaning of these term of .cfg
file.
@visha-l @vg123 You should use:
yolo-voc.2.0.cfg
as initial cfg-file for your yolo-car.cfg
as described here: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
I am making a detector as you have mentioned in this link build detector from scratch, I followed all the instructions and now I am using
p2.xlarge
(EC2
instance ofAWS
) to provide itgpu
for fast training, but it is training with the same speed as it was, when I was not usinggpu
. Thisp2.xlarge
hasGPU=1
so It should run faster but, this is not happening why ?I have changed in
Makefile
(GPU=1) , as instructed in link. so tell me what else needs to be done.