Open cym0301 opened 4 years ago
@cym0301 Hi,
Try to use the latest commit, do you get this error?
Same error still occurred after re-compiling with the latest code
Hi.
Where can I find those 2 files?
It will be created near with darknet.exe file if there are errors in your traiing dataset.
Run this file and show screenshot. nvidia-smi.zip
Show content of file innoiris.data
Run training, press Pause, and show screenshot like this:
classes = 4
train = data/train.txt
valid = data/valid.txt
names = data/innoiris.names
backup = backup/
3. Training Sreenshot
![image](https://user-images.githubusercontent.com/17352505/72164983-acfb8d80-3401-11ea-9db5-5b4256124542.png)
Try to train it with another dataset. Or share your dataset, I will try to train with it.
Hi, i have similar problem: I'm using Nvidia GeForce GTX 1080Ti and Windows 10, I compiled by using CMake there is no bad.list and bad_label.list I'am using latest version of darknet.
Execution of nvidia-smi.exe:
Content of obj.data
classes= 1
train = train.txt
valid = test.txt
names = obj.names
backup = backup/
Strange thing: if I set valid same as train, like this:
classes= 1
train = train.txt
valid = train.txt
names = obj.names
backup = backup/
Everything working properly.
@KimalIsaev Your valid datraset is incorrect.
How do I send you my dataset for you to try? I've updated to the latest version and still no luck.
Try to set and train:
train = valid.txt valid = valid.txt
Also check your valid dataset by using Yolo_mark.
I have tried both setting train and valid to valid.txt and setting train and valid to train.txt. It is still not working
Alexey, Is there some way to find out what exactly incorrect in dataset?
I have tried both setting train and valid to valid.txt and setting train and valid to train.txt.
If you train with icorrect dataset, then it should create bad.list and bad_label.list files.
Darknet doesn't create bad.list and bad_label.list files.
@cym0301 If you can successfully train with Train dataset, but can't train with Valid dataset, then send Valid dataset to alexeyab84@gmail.com (or send URL to the google-disk with your valid dataset)
After 200 steps with just Train dataset darknet again gives the same error.
Even if i set batch=1 and have just 1 img in dataset it gives similar error. After something like 100 iteration.
Strange thing: if I set valid same as train, like this:
classes= 1 train = train.txt valid = train.txt names = obj.names backup = backup/ Everything working properly.
After 200 steps with just Train dataset darknet again gives the same error.
So do you have an issue or dont?
I have issue after 200 steps.
@KimalIsaev
Show screenshot from: open \darknet.sln -> (right click on project) -> properties -> CUDA C/C++ -> Device
attach your cfg file in zip
run training with -show_imgs
flag and show screenshot
Did you check your dataset by using Yolo_mark?
I have the same trouble as theirs when training csresnext50-panet-spp.cfg on MS COCO
@KimalIsaev Download the latest code of Darknet.
Try to set there:
compute_30,sm_30
or
compute_61,sm_61
and recompile
I just tried the latest commit d88a9eb with my dataset. I ran for like 40 iterations and no error occurred. However, the current average loss became nan
if compute_30,sm_30
with latest commit:
if compute_35,sm_35
with latest commit:
if compute_61,sm_61
with latest commit:
Every time on 3 iteration.
Hi ,It's ok when training yolov3.cfg on pascal voc,while not on the csresnext50-panet-spp.cfg,so could you share your successful training experience on csresnext50-panet-spp.cfg including hardware info, environment configure,training process and so on, or where is your last successful training commit code on csresnext50-panet-spp.cfg, looking forward to your reply
@KimalIsaev Use the latest commit.
@MrCuiHao
It's ok when training yolov3.cfg on pascal voc,while not on the csresnext50-panet-spp.cfg,
What do you mean?
@AlexeyAB I have sent you my dataset a few day ago. Please check and comment on what I should do next
@AlexeyAB
Wth latest commit andcompute_35,sm_35
sometimes old error pops out, sometimes this one:
if compute_61,sm_61:
@KimalIsaev
Download new latest commit
open \darknet.sln
-> (right click on project) -> properties -> C/C++ -> Preprocessor -> Definitions -> Preprocessor Definitions
Add CUDA_DEBUG;
at the start of this line
Recompile Darknet
Run training, and show screenshots of errors. (it will run slower - this is normal)
@cym0301 Can you do the same? https://github.com/AlexeyAB/darknet/issues/4657#issuecomment-575261274
@AlexeyAB Recompile with Release or Debug option?
Release
@AlexeyAB 1000 steps without any error. If i remove CUDA_DEBUG the error appears again.
@MrCuiHao
It's ok when training yolov3.cfg on pascal voc,while not on the csresnext50-panet-spp.cfg,
What do you mean?
I mean that training about more than 100 steps , error occurs as follows: but when I add CUDA_DEBUG at latest commit , error have not occured yet, it's so strange,and run slower than before
I found a little secret: When I don't add CUDA_DEBUG at latest commit, the GPU-Util occupy about 70%~90%,but about 20% with CUDA_DEBUG,does this have anything to do with it?
@cym0301 @KimalIsaev
How many CPU RAM do you have?
Can you successfully train without CUDA_DEBUG;
and with flag -benchmark_layers
at the end of training command?
Can you successfully train if you remove CUDA_DEBUG;
and CUDNN;
and CUDNN_HALF;
from open \darknet.sln
-> (right click on project) -> properties -> C/C++ -> Preprocessor -> Definitions -> Preprocessor Definitions
recompile
Try to install new GPU Drivers and new CUDA version.
add CUDA_DEBUG at latest commit, the GPU-Util occupy about 70%~90%,but about 20% with CUDA_DEBUG
This is normal.
-CPU RAM = 32Gb -Without CUDA_DEBUG and with flag -benchmark_layers 500 iterations without an error
Without CUDNN project can't be build, in the visual studio error pops out.
@KimalIsaev
-Without CUDA_DEBUG and with flag -benchmark_layers 500 iterations without an error
It is very strange.
Without CUDNN project can't be build, in the visual studio error pops out.
Can you show screenshot of this error?
Did you remove CUDNN;CUDNN_HALF;
?
3>convolutional_kernels.obj : error LNK2019: ссылка на неразрешенный внешний символ cudnn_handle в функции backward_convolutional_layer_gpu
3>convolutional_kernels.obj : error LNK2019: ссылка на неразрешенный внешний символ cudnn_check_error_extended в функции backward_convolutional_layer_gpu
3>C:\project\darknet-master\Release\darknet.exe : fatal error LNK1120: неразрешенных внешних элементов: 2
2>scale_channels_layer.c
3>Сборка проекта "darknet.vcxproj" завершена с ошибкой.
sorry for russian.
Training on the same dataset with tiny-yolo and standart setting goes without an error.
-Without CUDA_DEBUG and with flag -benchmark_layers 500 iterations without an error It is very strange.
After 1000 more steps i've got an error.
@KimalIsaev I added fix.
Try to download new Darknet and do the same - train without CUDA_DEBUG
and with flag -benchmark_layers
for more than 1000 iterations, and show screenshot of error.
Do you use Cmake for compiling? Or do you use default build/darknet/darknet.sln
file?
I use Cmake.
Hi everyone,
I am a beginner of object detection and currently I am trying out csresnext50-panet-spp. I started the training with the command "darknet.exe detector train data/innoiris.data cfg/innoiris.cfg csresnext50-panet-spp.conv.112 -map" and the configuration file attached. During training, the error shown in the screenshot occurred. May I know if it is caused by my wrong configuration or other hardware issues (I am using OpenCV 4.2 with CUDA 10.2 and cuDNN 7.6.5.32, as well as one GTX1080, for training.)? I am not using the latest version of darknet but 6878ecc instead.
innoiris.txt