AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.66k stars 7.96k forks source link

status Error dark_cuda.c : cuda_push_array() : Line 458 #4657

Open cym0301 opened 4 years ago

cym0301 commented 4 years ago

Hi everyone,

I am a beginner of object detection and currently I am trying out csresnext50-panet-spp. I started the training with the command "darknet.exe detector train data/innoiris.data cfg/innoiris.cfg csresnext50-panet-spp.conv.112 -map" and the configuration file attached. During training, the error shown in the screenshot occurred. May I know if it is caused by my wrong configuration or other hardware issues (I am using OpenCV 4.2 with CUDA 10.2 and cuDNN 7.6.5.32, as well as one GTX1080, for training.)? I am not using the latest version of darknet but 6878ecc instead.

innoiris.txt

image

AlexeyAB commented 4 years ago

@cym0301 Hi,

Try to use the latest commit, do you get this error?

cym0301 commented 4 years ago

image Same error still occurred after re-compiling with the latest code

AlexeyAB commented 4 years ago

I fixed some bug: https://github.com/AlexeyAB/darknet/commit/619e39fc71a7c65f9d33aaca8ec05167796e88aa

AlexeyAB commented 4 years ago
cym0301 commented 4 years ago

Hi.

  1. I am now using 67c91e6 but still same error
  2. I am using Windows
  3. I compiled by using MSVC darknet.sln
  4. Where can I find those 2 files?
AlexeyAB commented 4 years ago

Where can I find those 2 files?

It will be created near with darknet.exe file if there are errors in your traiing dataset.

Run this file and show screenshot. nvidia-smi.zip

Show content of file innoiris.data

Run training, press Pause, and show screenshot like this: image

cym0301 commented 4 years ago
  1. Execution of nvidia-smi.exe image
  2. Content of innoiris.data
    
    classes = 4
    train = data/train.txt
    valid = data/valid.txt
    names = data/innoiris.names
    backup = backup/

3.  Training Sreenshot
![image](https://user-images.githubusercontent.com/17352505/72164983-acfb8d80-3401-11ea-9db5-5b4256124542.png)
AlexeyAB commented 4 years ago

Try to train it with another dataset. Or share your dataset, I will try to train with it.

KimalIsaev commented 4 years ago

Hi, i have similar problem: ERROR I'm using Nvidia GeForce GTX 1080Ti and Windows 10, I compiled by using CMake there is no bad.list and bad_label.list I'am using latest version of darknet.

  1. Execution of nvidia-smi.exe: SMI

  2. Content of obj.data

classes= 1
train  = train.txt
valid  = test.txt
names = obj.names
backup = backup/
  1. Training Screenshot: beg
KimalIsaev commented 4 years ago

Strange thing: if I set valid same as train, like this:

classes= 1
train  = train.txt
valid  = train.txt
names = obj.names
backup = backup/

Everything working properly.

AlexeyAB commented 4 years ago

@KimalIsaev Your valid datraset is incorrect.

cym0301 commented 4 years ago

How do I send you my dataset for you to try? I've updated to the latest version and still no luck.

AlexeyAB commented 4 years ago

Try to set and train:

train = valid.txt valid = valid.txt

Also check your valid dataset by using Yolo_mark.

cym0301 commented 4 years ago

I have tried both setting train and valid to valid.txt and setting train and valid to train.txt. It is still not working

KimalIsaev commented 4 years ago

Alexey, Is there some way to find out what exactly incorrect in dataset?

AlexeyAB commented 4 years ago

I have tried both setting train and valid to valid.txt and setting train and valid to train.txt.

If you train with icorrect dataset, then it should create bad.list and bad_label.list files.

KimalIsaev commented 4 years ago

Darknet doesn't create bad.list and bad_label.list files.

AlexeyAB commented 4 years ago

@cym0301 If you can successfully train with Train dataset, but can't train with Valid dataset, then send Valid dataset to alexeyab84@gmail.com (or send URL to the google-disk with your valid dataset)

KimalIsaev commented 4 years ago

After 200 steps with just Train dataset darknet again gives the same error.

KimalIsaev commented 4 years ago

Even if i set batch=1 and have just 1 img in dataset it gives similar error. dontknow After something like 100 iteration.

AlexeyAB commented 4 years ago

Strange thing: if I set valid same as train, like this:

classes= 1 train = train.txt valid = train.txt names = obj.names backup = backup/ Everything working properly.

After 200 steps with just Train dataset darknet again gives the same error.

So do you have an issue or dont?

KimalIsaev commented 4 years ago

I have issue after 200 steps.

AlexeyAB commented 4 years ago

@KimalIsaev

MrCuiHao commented 4 years ago

I have the same trouble as theirs when training csresnext50-panet-spp.cfg on MS COCO

KimalIsaev commented 4 years ago
AlexeyAB commented 4 years ago

@KimalIsaev Download the latest code of Darknet.

Try to set there: compute_30,sm_30 or compute_61,sm_61

and recompile

dontknow

cym0301 commented 4 years ago

I just tried the latest commit d88a9eb with my dataset. I ran for like 40 iterations and no error occurred. However, the current average loss became nan

KimalIsaev commented 4 years ago

if compute_30,sm_30 with latest commit: itch2 if compute_35,sm_35 with latest commit: otch if compute_61,sm_61 with latest commit: atch Every time on 3 iteration.

MrCuiHao commented 4 years ago

Hi ,It's ok when training yolov3.cfg on pascal voc,while not on the csresnext50-panet-spp.cfg,so could you share your successful training experience on csresnext50-panet-spp.cfg including hardware info, environment configure,training process and so on, or where is your last successful training commit code on csresnext50-panet-spp.cfg, looking forward to your reply

KimalIsaev commented 4 years ago

This is cfg file: pasc.zip It's nearly identical to csresnext50-panet-spp-original-optimal.cfg, number of classes is changed to 20, learning rate a little bit higher then original. I have Nvidia GeForce GTX 1080Ti. Training have been done on 67c91e6

AlexeyAB commented 4 years ago

@KimalIsaev Use the latest commit.

AlexeyAB commented 4 years ago

@MrCuiHao

It's ok when training yolov3.cfg on pascal voc,while not on the csresnext50-panet-spp.cfg,

What do you mean?

cym0301 commented 4 years ago

@AlexeyAB I have sent you my dataset a few day ago. Please check and comment on what I should do next

KimalIsaev commented 4 years ago

@AlexeyAB Wth latest commit andcompute_35,sm_35sometimes old error pops out, sometimes this one: memacc if compute_61,sm_61: 61

AlexeyAB commented 4 years ago

@KimalIsaev

AlexeyAB commented 4 years ago

@cym0301 Can you do the same? https://github.com/AlexeyAB/darknet/issues/4657#issuecomment-575261274

KimalIsaev commented 4 years ago

@AlexeyAB Recompile with Release or Debug option?

AlexeyAB commented 4 years ago

Release

KimalIsaev commented 4 years ago

@AlexeyAB 1000 steps without any error. If i remove CUDA_DEBUG the error appears again.

MrCuiHao commented 4 years ago

@MrCuiHao

It's ok when training yolov3.cfg on pascal voc,while not on the csresnext50-panet-spp.cfg,

What do you mean?

I mean that training about more than 100 steps , error occurs as follows: image but when I add CUDA_DEBUG at latest commit , error have not occured yet, it's so strange,and run slower than before image

MrCuiHao commented 4 years ago

I found a little secret: When I don't add CUDA_DEBUG at latest commit, the GPU-Util occupy about 70%~90%,but about 20% with CUDA_DEBUG,does this have anything to do with it? image

AlexeyAB commented 4 years ago

@cym0301 @KimalIsaev


add CUDA_DEBUG at latest commit, the GPU-Util occupy about 70%~90%,but about 20% with CUDA_DEBUG

This is normal.

KimalIsaev commented 4 years ago

-CPU RAM = 32Gb -Without CUDA_DEBUG and with flag -benchmark_layers 500 iterations without an error

KimalIsaev commented 4 years ago

Without CUDNN project can't be build, in the visual studio error pops out.

AlexeyAB commented 4 years ago

@KimalIsaev

-Without CUDA_DEBUG and with flag -benchmark_layers 500 iterations without an error

It is very strange.

Without CUDNN project can't be build, in the visual studio error pops out.

Can you show screenshot of this error? Did you remove CUDNN;CUDNN_HALF; ?

KimalIsaev commented 4 years ago

vcerror

3>convolutional_kernels.obj : error LNK2019: ссылка на неразрешенный внешний символ cudnn_handle в функции backward_convolutional_layer_gpu
3>convolutional_kernels.obj : error LNK2019: ссылка на неразрешенный внешний символ cudnn_check_error_extended в функции backward_convolutional_layer_gpu
3>C:\project\darknet-master\Release\darknet.exe : fatal error LNK1120: неразрешенных внешних элементов: 2
2>scale_channels_layer.c
3>Сборка проекта "darknet.vcxproj" завершена с ошибкой.

sorry for russian.

KimalIsaev commented 4 years ago

Training on the same dataset with tiny-yolo and standart setting goes without an error.

KimalIsaev commented 4 years ago

-Without CUDA_DEBUG and with flag -benchmark_layers 500 iterations without an error It is very strange.

After 1000 more steps i've got an error. deb

AlexeyAB commented 4 years ago

@KimalIsaev I added fix.

KimalIsaev commented 4 years ago

I use Cmake.