Open cym0301 opened 4 years ago
@KimalIsaev
So also you can try to disable CUDNN and CUDNN_HALF in Cmake, then press Generate -> OpenProject -> Recompile and train, what error will you get?
@AlexeyAB
Try to download new Darknet and do the same - train without CUDA_DEBUG and with flag -benchmark_layers for more than 1000 iterations, and show screenshot of error.
After 10000 iterations no error.
Hi Have the same problem of cuda_push_array() error line 457
tried using -show_imgs
and received these images
Is this normal? Also using -show_images needs manual keyboard intervening.
I am using mscoco with csresnext50-panet-spp-original-optimal.cfg
command line
darknet detector train data/coco.data csresnext50-panet-spp-original-optimal.cfg darknet53.conv.74 -show_imgs
running on windows 10 GeForce 1080ti
Compiled using CMake VS2019 Latest Darknet version.
If you can please help to clarify. Thanks
Also would like to ask for MSCOCO max_batches is 500200 does it realy need so many iterations when using darknet53.conv.74 ? this can take very long.
Hi ,It's ok when training yolov3.cfg on pascal voc,while not on the csresnext50-panet-spp.cfg,so could you share your successful training experience on csresnext50-panet-spp.cfg including hardware info, environment configure,training process and so on, or where is your last successful training commit code on csresnext50-panet-spp.cfg, looking forward to your reply
Hi @MrCuiHao did you solve the problem for csresnext50-panet-spp.cfg
@mwindowshz Did you tried training with flag -benchmark_layers
?
@Kimallsaev Yes I did try but then it seemed to be slower process, also gpu load is 20-60% instead of being up n the 80-90% . So it seemed like it would take so much more time to train. and there are 500200 iterations set in the cfg file. But Ok, I am trying again now. got to 70 iterations no crash, would update
Also I am trying to train yolov3.cfg this works, but my graph looks like this: is this ok?
Thanks
@mwindowshz It's slower, but for now it's only solution. Or find bug and change source code.
Ok, interesting what version of darknet was used with csresnext50-panet-spp.cfg before the bug.
Using -benchmark_layers is very very very slow, working for almost 20 hours and only at 3500 iterations on 1080ti
And about the learning graph with yolov3.cfg that I posted above, is this normal, does anyone have experience with this? https://user-images.githubusercontent.com/17617515/76191918-1cff7780-61e9-11ea-9406-abd13c092409.png
thanks.
@mwindowshz
Try to disable CUDNN and CUDNN_HALF in Cmake, then press Generate -> OpenProject -> Recompile and train without -benchmark_layers, what error will you get?
And about the learning graph with yolov3.cfg that I posted above, is this normal, does anyone have experience with this? https://user-images.githubusercontent.com/17617515/76191918-1cff7780-61e9-11ea-9406-abd13c092409.png
This is normal.
Hi
Still error
CUDA status Error: file: C:\Users\protrack\source\repos\darknet\src\dark_cuda.c : cuda_push_array() : line: 457 : build time: Mar 12 2020 - 11:45:30
CUDA Error: unspecified launch failure
also in each yolo layer there is this message
137 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000
Unused field: 'uc_normalizer = 0.07'
Unused field: 'beta1 = 0.6'
I don't have Enable Zed Camera, do I need to download something or this is just an another avilable option
These are the setting used:
@mwindowshz
Still error
CUDA status Error: file: C:\Users\protrack\source\repos\darknet\src\dark_cuda.c : cuda_push_array() : line: 457 : build time: Mar 12 2020 - 11:45:30 CUDA Error: unspecified launch failure
Do you get this error with -benchmark_layer
flag?
What error do you get with -benchmark_layer
flag?
Also try to download the latest Darknet version and try to run:
-cuda_debug_sync
flag-benchmark_layer -cuda_debug_sync
flagsand show all errors
Hi running regular train has an error:
CUDA status Error: file: C:\Users\...\source\repos\darknet\src\dark_cuda.c : cuda_push_array() : line: 469 : build time: Mar 18 2020 - 13:03:04
CUDA Error: unspecified launch failure
CUDA Error: unspecified launch failure: No error
Assertion failed: 0, file C:\Users\...\source\repos\darknet\src\utils.c, line 325
Using -benchmark_layer
There is no error but training is very very slow.
Using -cuda_debug_sync
had no error! it seems slower
Also when loading the cfg file there are these comments on the yolo layer that some variables are not being used , why? because this version of darknet does not support them?
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000
Unused field: 'uc_normalizer = 0.07'
Unused field: 'beta1 = 0.6'
@mwindowshz
CUDA status Error: file: C:\Users...\source\repos\darknet\src\dark_cuda.c : cuda_push_array() : line: 469 : build time: Mar 18 2020 - 13:03:04
CUDA Error: unspecified launch failure
CUDA Error: unspecified launch failure: No error Assertion failed: 0, file C:\Users...\source\repos\darknet\src\utils.c, line 325
Unused field: 'uc_normalizer = 0.07'
Is only for Gaussian-yolo
Unused field: 'beta1 = 0.6'
Just isn't required at all.
Hi I have Cuda 10.2
D:\Training\mscoco>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:32:27_Pacific_Daylight_Time_2019
Cuda compilation tools, release 10.2, V10.2.89
cudnn 7.6.5
visual studio 2019
CUDNN 7.4.2 is not compatible with this version.
I compiled with CUDNN_HALF, CUDNN
tryied to compile without CUDNN_HALF and without CUDNN not crashing but learning is NAN
(next mAP calculation at 7329 iterations)
10: nan, nan avg loss, 0.000000 rate, 31.047000 seconds, 640 images
Resizing, random_coef = 1.40
one line example:
v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 137 Avg (IOU: nan, GIOU: nan), Class: nan, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 12, class_loss = 12.000000, iou_loss = 0.000000, total_loss = 12.000000
removing only CUDNN_HALF resulted with error
CUDA status Error: file: C:\Users\....\source\repos\darknet\src\dark_cuda.c : cuda_push_array() : line: 469 : build time: Mar 23 2020 - 11:46:32
CUDA Error: unspecified launch failure
CUDA Error: unspecified launch failure: No error
Assertion failed: 0, file C:\Users\...\source\repos\darknet\src\utils.c, line 325
removing only CUDNN_HALF resulted with error
CUDA status Error: file: C:\Users....\source\repos\darknet\src\dark_cuda.c : cuda_push_array() : line: 469 : build time: Mar 23 2020 - 11:46:32
CUDA Error: unspecified launch failure
CUDA Error: unspecified launch failure: No error Assertion failed: 0, file C:\Users...\source\repos\darknet\src\utils.c, line 325
Can you show error with previous message?
Did you train with -benchmark_layer -cuda_debug_sync
flags?
Hi
did not understand
Using normal compile with CUDNN and CUDNN_HALF,
and using flags -benchmark_layer
-cuda_debug_sync
separately
worked there was not crash, but training is very slow, so I did not complete training.
should the flags be used together ?
should the flags be used together ?
Only for debugging to catch the error place. I can't reproduce your error.
Hi Uploaded dump file of the crash can this help dump Thanks
hi alexy i have the same issue is there any update on this?
Hi everyone,
I am a beginner of object detection and currently I am trying out csresnext50-panet-spp. I started the training with the command "darknet.exe detector train data/innoiris.data cfg/innoiris.cfg csresnext50-panet-spp.conv.112 -map" and the configuration file attached. During training, the error shown in the screenshot occurred. May I know if it is caused by my wrong configuration or other hardware issues (I am using OpenCV 4.2 with CUDA 10.2 and cuDNN 7.6.5.32, as well as one GTX1080, for training.)? I am not using the latest version of darknet but 6878ecc instead.
innoiris.txt