status Error dark_cuda.c : cuda_push_array() : Line 458

cym0301 commented 4 years ago

Hi everyone,

I am a beginner of object detection and currently I am trying out csresnext50-panet-spp. I started the training with the command "darknet.exe detector train data/innoiris.data cfg/innoiris.cfg csresnext50-panet-spp.conv.112 -map" and the configuration file attached. During training, the error shown in the screenshot occurred. May I know if it is caused by my wrong configuration or other hardware issues (I am using OpenCV 4.2 with CUDA 10.2 and cuDNN 7.6.5.32, as well as one GTX1080, for training.)? I am not using the latest version of darknet but 6878ecc instead.

innoiris.txt

AlexeyAB commented 4 years ago

@cym0301 Hi,

Try to use the latest commit, do you get this error?

cym0301 commented 4 years ago

Same error still occurred after re-compiling with the latest code

AlexeyAB commented 4 years ago

I fixed some bug: https://github.com/AlexeyAB/darknet/commit/619e39fc71a7c65f9d33aaca8ec05167796e88aa

AlexeyAB commented 4 years ago

I have trained your model for 100 iterations and didn't get any errors
Do you use Windows or Linux?
Did you compile Darknet by using Cmake, vcpkg or MSVS darknet.sln?
Show content of files bad.list and bad_label.list

cym0301 commented 4 years ago

Hi.

I am now using 67c91e6 but still same error
I am using Windows
I compiled by using MSVC darknet.sln
Where can I find those 2 files?

AlexeyAB commented 4 years ago

Where can I find those 2 files?

It will be created near with darknet.exe file if there are errors in your traiing dataset.

Run this file and show screenshot. nvidia-smi.zip

Show content of file innoiris.data

Run training, press Pause, and show screenshot like this:

cym0301 commented 4 years ago

Execution of nvidia-smi.exe

Content of innoiris.data


classes = 4
train = data/train.txt
valid = data/valid.txt
names = data/innoiris.names
backup = backup/


3.  Training Sreenshot
![image](https://user-images.githubusercontent.com/17352505/72164983-acfb8d80-3401-11ea-9db5-5b4256124542.png)

AlexeyAB commented 4 years ago

Try to train it with another dataset. Or share your dataset, I will try to train with it.

KimalIsaev commented 4 years ago

Hi, i have similar problem: ERROR I'm using Nvidia GeForce GTX 1080Ti and Windows 10, I compiled by using CMake there is no bad.list and bad_label.list I'am using latest version of darknet.

Execution of nvidia-smi.exe:
Content of obj.data

classes= 1
train  = train.txt
valid  = test.txt
names = obj.names
backup = backup/

Training Screenshot:

KimalIsaev commented 4 years ago

Strange thing: if I set valid same as train, like this:

classes= 1
train  = train.txt
valid  = train.txt
names = obj.names
backup = backup/

Everything working properly.

AlexeyAB commented 4 years ago

@KimalIsaev Your valid datraset is incorrect.

cym0301 commented 4 years ago

How do I send you my dataset for you to try? I've updated to the latest version and still no luck.

AlexeyAB commented 4 years ago

Try to set and train:

train = valid.txt valid = valid.txt

Also check your valid dataset by using Yolo_mark.

cym0301 commented 4 years ago

I have tried both setting train and valid to valid.txt and setting train and valid to train.txt. It is still not working

KimalIsaev commented 4 years ago

Alexey, Is there some way to find out what exactly incorrect in dataset?

AlexeyAB commented 4 years ago

I have tried both setting train and valid to valid.txt and setting train and valid to train.txt.

If you train with icorrect dataset, then it should create bad.list and bad_label.list files.

KimalIsaev commented 4 years ago

Darknet doesn't create bad.list and bad_label.list files.

AlexeyAB commented 4 years ago

@cym0301 If you can successfully train with Train dataset, but can't train with Valid dataset, then send Valid dataset to alexeyab84@gmail.com (or send URL to the google-disk with your valid dataset)

KimalIsaev commented 4 years ago

After 200 steps with just Train dataset darknet again gives the same error.

KimalIsaev commented 4 years ago

Even if i set batch=1 and have just 1 img in dataset it gives similar error. dontknow After something like 100 iteration.

AlexeyAB commented 4 years ago

Strange thing: if I set valid same as train, like this:

classes= 1 train = train.txt valid = train.txt names = obj.names backup = backup/ Everything working properly.

After 200 steps with just Train dataset darknet again gives the same error.

So do you have an issue or dont?

KimalIsaev commented 4 years ago

I have issue after 200 steps.

AlexeyAB commented 4 years ago

@KimalIsaev

Show screenshot from: open \darknet.sln -> (right click on project) -> properties -> CUDA C/C++ -> Device
attach your cfg file in zip
run training with -show_imgs flag and show screenshot
Did you check your dataset by using Yolo_mark?

MrCuiHao commented 4 years ago

I have the same trouble as theirs when training csresnext50-panet-spp.cfg on MS COCO

KimalIsaev commented 4 years ago

I have trained on Pascal Voc, trainig goes without any issue.
CUDA C/C++ -> Device
slow_plate.zip
if i run with -show_imgs flag it shows something like 6 images and then stops
I have used Yolo_mark to create my dataset, then used python to create empty txt files, and black&white copies of labeled images. Final dataset wasn't checked. Doing it right now.

AlexeyAB commented 4 years ago

@KimalIsaev Download the latest code of Darknet.

Try to set there: compute_30,sm_30 or compute_61,sm_61

and recompile

dontknow

cym0301 commented 4 years ago

I just tried the latest commit d88a9eb with my dataset. I ran for like 40 iterations and no error occurred. However, the current average loss became nan

KimalIsaev commented 4 years ago

if compute_30,sm_30 with latest commit: itch2 if compute_35,sm_35 with latest commit: otch if compute_61,sm_61 with latest commit: atch Every time on 3 iteration.

MrCuiHao commented 4 years ago

Hi ，It's ok when training yolov3.cfg on pascal voc，while not on the csresnext50-panet-spp.cfg，so could you share your successful training experience on csresnext50-panet-spp.cfg including hardware info, environment configure，training process and so on, or where is your last successful training commit code on csresnext50-panet-spp.cfg, looking forward to your reply

KimalIsaev commented 4 years ago

This is cfg file: pasc.zip It's nearly identical to csresnext50-panet-spp-original-optimal.cfg, number of classes is changed to 20, learning rate a little bit higher then original. I have Nvidia GeForce GTX 1080Ti. Training have been done on 67c91e6

AlexeyAB commented 4 years ago

@KimalIsaev Use the latest commit.

AlexeyAB commented 4 years ago

@MrCuiHao

It's ok when training yolov3.cfg on pascal voc，while not on the csresnext50-panet-spp.cfg，

What do you mean?

cym0301 commented 4 years ago

@AlexeyAB I have sent you my dataset a few day ago. Please check and comment on what I should do next

KimalIsaev commented 4 years ago

@AlexeyAB Wth latest commit andcompute_35,sm_35sometimes old error pops out, sometimes this one: memacc if compute_61,sm_61:

AlexeyAB commented 4 years ago

@KimalIsaev

Download new latest commit
open \darknet.sln -> (right click on project) -> properties -> C/C++ -> Preprocessor -> Definitions -> Preprocessor Definitions Add CUDA_DEBUG; at the start of this line
Recompile Darknet
Run training, and show screenshots of errors. (it will run slower - this is normal)

AlexeyAB commented 4 years ago

@cym0301 Can you do the same? https://github.com/AlexeyAB/darknet/issues/4657#issuecomment-575261274

KimalIsaev commented 4 years ago

@AlexeyAB Recompile with Release or Debug option?

AlexeyAB commented 4 years ago

Release

KimalIsaev commented 4 years ago

@AlexeyAB 1000 steps without any error. If i remove CUDA_DEBUG the error appears again.

MrCuiHao commented 4 years ago

@MrCuiHao

It's ok when training yolov3.cfg on pascal voc，while not on the csresnext50-panet-spp.cfg，

What do you mean?

I mean that training about more than 100 steps , error occurs as follows: but when I add CUDA_DEBUG at latest commit , error have not occured yet, it's so strange，and run slower than before

MrCuiHao commented 4 years ago

I found a little secret： When I don't add CUDA_DEBUG at latest commit, the GPU-Util occupy about 70%~90%，but about 20% with CUDA_DEBUG，does this have anything to do with it？

AlexeyAB commented 4 years ago

@cym0301 @KimalIsaev

How many CPU RAM do you have?
Can you successfully train without CUDA_DEBUG; and with flag -benchmark_layers at the end of training command?
Can you successfully train if you remove CUDA_DEBUG; and CUDNN; and CUDNN_HALF; from open \darknet.sln -> (right click on project) -> properties -> C/C++ -> Preprocessor -> Definitions -> Preprocessor Definitions recompile
Try to install new GPU Drivers and new CUDA version.

add CUDA_DEBUG at latest commit, the GPU-Util occupy about 70%~90%，but about 20% with CUDA_DEBUG

This is normal.

KimalIsaev commented 4 years ago

-CPU RAM = 32Gb -Without CUDA_DEBUG and with flag -benchmark_layers 500 iterations without an error

KimalIsaev commented 4 years ago

Without CUDNN project can't be build, in the visual studio error pops out.

AlexeyAB commented 4 years ago

@KimalIsaev

-Without CUDA_DEBUG and with flag -benchmark_layers 500 iterations without an error

It is very strange.

Without CUDNN project can't be build, in the visual studio error pops out.

Can you show screenshot of this error? Did you remove CUDNN;CUDNN_HALF; ?

KimalIsaev commented 4 years ago

vcerror

3>convolutional_kernels.obj : error LNK2019: ссылка на неразрешенный внешний символ cudnn_handle в функции backward_convolutional_layer_gpu
3>convolutional_kernels.obj : error LNK2019: ссылка на неразрешенный внешний символ cudnn_check_error_extended в функции backward_convolutional_layer_gpu
3>C:\project\darknet-master\Release\darknet.exe : fatal error LNK1120: неразрешенных внешних элементов: 2
2>scale_channels_layer.c
3>Сборка проекта "darknet.vcxproj" завершена с ошибкой.

sorry for russian.

KimalIsaev commented 4 years ago

Training on the same dataset with tiny-yolo and standart setting goes without an error.

KimalIsaev commented 4 years ago

-Without CUDA_DEBUG and with flag -benchmark_layers 500 iterations without an error It is very strange.

After 1000 more steps i've got an error. deb

AlexeyAB commented 4 years ago

@KimalIsaev I added fix.

Try to download new Darknet and do the same - train without CUDA_DEBUG and with flag -benchmark_layers for more than 1000 iterations, and show screenshot of error.
Do you use Cmake for compiling? Or do you use default build/darknet/darknet.sln file?

KimalIsaev commented 4 years ago

I use Cmake.

AlexeyAB / darknet

status Error dark_cuda.c : cuda_push_array() : Line 458 #4657