status Error dark_cuda.c : cuda_push_array() : Line 458

cym0301 commented 4 years ago

Hi everyone,

I am a beginner of object detection and currently I am trying out csresnext50-panet-spp. I started the training with the command "darknet.exe detector train data/innoiris.data cfg/innoiris.cfg csresnext50-panet-spp.conv.112 -map" and the configuration file attached. During training, the error shown in the screenshot occurred. May I know if it is caused by my wrong configuration or other hardware issues (I am using OpenCV 4.2 with CUDA 10.2 and cuDNN 7.6.5.32, as well as one GTX1080, for training.)? I am not using the latest version of darknet but 6878ecc instead.

innoiris.txt

AlexeyAB commented 4 years ago

@KimalIsaev

So also you can try to disable CUDNN and CUDNN_HALF in Cmake, then press Generate -> OpenProject -> Recompile and train, what error will you get?

KimalIsaev commented 4 years ago

@AlexeyAB

Try to download new Darknet and do the same - train without CUDA_DEBUG and with flag -benchmark_layers for more than 1000 iterations, and show screenshot of error.

After 10000 iterations no error.

mwindowshz commented 4 years ago

Hi Have the same problem of cuda_push_array() error line 457

tried using -show_imgs and received these images

Is this normal? Also using -show_images needs manual keyboard intervening.

I am using mscoco with csresnext50-panet-spp-original-optimal.cfg

command line darknet detector train data/coco.data csresnext50-panet-spp-original-optimal.cfg darknet53.conv.74 -show_imgs

running on windows 10 GeForce 1080ti

Compiled using CMake VS2019 Latest Darknet version.

If you can please help to clarify. Thanks

Also would like to ask for MSCOCO max_batches is 500200 does it realy need so many iterations when using darknet53.conv.74 ? this can take very long.

mwindowshz commented 4 years ago

Hi ，It's ok when training yolov3.cfg on pascal voc，while not on the csresnext50-panet-spp.cfg，so could you share your successful training experience on csresnext50-panet-spp.cfg including hardware info, environment configure，training process and so on, or where is your last successful training commit code on csresnext50-panet-spp.cfg, looking forward to your reply

Hi @MrCuiHao did you solve the problem for csresnext50-panet-spp.cfg

KimalIsaev commented 4 years ago

@mwindowshz Did you tried training with flag -benchmark_layers?

mwindowshz commented 4 years ago

@Kimallsaev Yes I did try but then it seemed to be slower process, also gpu load is 20-60% instead of being up n the 80-90% . So it seemed like it would take so much more time to train. and there are 500200 iterations set in the cfg file. But Ok, I am trying again now. got to 70 iterations no crash, would update

Also I am trying to train yolov3.cfg this works, but my graph looks like this: is this ok? chart_yolov3

Thanks

KimalIsaev commented 4 years ago

@mwindowshz It's slower, but for now it's only solution. Or find bug and change source code.

mwindowshz commented 4 years ago

Ok, interesting what version of darknet was used with csresnext50-panet-spp.cfg before the bug.

Using -benchmark_layers is very very very slow, working for almost 20 hours and only at 3500 iterations on 1080ti

And about the learning graph with yolov3.cfg that I posted above, is this normal, does anyone have experience with this? https://user-images.githubusercontent.com/17617515/76191918-1cff7780-61e9-11ea-9406-abd13c092409.png

thanks.

AlexeyAB commented 4 years ago

@mwindowshz

Try to disable CUDNN and CUDNN_HALF in Cmake, then press Generate -> OpenProject -> Recompile and train without -benchmark_layers, what error will you get?

And about the learning graph with yolov3.cfg that I posted above, is this normal, does anyone have experience with this? https://user-images.githubusercontent.com/17617515/76191918-1cff7780-61e9-11ea-9406-abd13c092409.png

This is normal.

mwindowshz commented 4 years ago

Hi

Still error

CUDA status Error: file: C:\Users\protrack\source\repos\darknet\src\dark_cuda.c : cuda_push_array() : line: 457 : build time: Mar 12 2020 - 11:45:30
CUDA Error: unspecified launch failure

also in each yolo layer there is this message

 137 yolo
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000
Unused field: 'uc_normalizer = 0.07'
Unused field: 'beta1 = 0.6'

I don't have Enable Zed Camera, do I need to download something or this is just an another avilable option

These are the setting used:

AlexeyAB commented 4 years ago

@mwindowshz

Still error

CUDA status Error: file: C:\Users\protrack\source\repos\darknet\src\dark_cuda.c : cuda_push_array() : line: 457 : build time: Mar 12 2020 - 11:45:30 CUDA Error: unspecified launch failure

Do you get this error with -benchmark_layer flag?

What error do you get with -benchmark_layer flag?

Also try to download the latest Darknet version and try to run:

with -cuda_debug_sync flag
with -benchmark_layer -cuda_debug_sync flags

and show all errors

mwindowshz commented 4 years ago

Hi running regular train has an error:

CUDA status Error: file: C:\Users\...\source\repos\darknet\src\dark_cuda.c : cuda_push_array() : line: 469 : build time: Mar 18 2020 - 13:03:04

 CUDA Error: unspecified launch failure

CUDA Error: unspecified launch failure: No error
Assertion failed: 0, file C:\Users\...\source\repos\darknet\src\utils.c, line 325

Using -benchmark_layer There is no error but training is very very slow.

Using -cuda_debug_sync

had no error! it seems slower

Also when loading the cfg file there are these comments on the yolo layer that some variables are not being used , why? because this version of darknet does not support them?

[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000
Unused field: 'uc_normalizer = 0.07'
Unused field: 'beta1 = 0.6'

AlexeyAB commented 4 years ago

@mwindowshz

CUDA status Error: file: C:\Users...\source\repos\darknet\src\dark_cuda.c : cuda_push_array() : line: 469 : build time: Mar 18 2020 - 13:03:04

CUDA Error: unspecified launch failure

CUDA Error: unspecified launch failure: No error Assertion failed: 0, file C:\Users...\source\repos\darknet\src\utils.c, line 325

Can you get this error if you use cuDNN 7.4.2?
Can you get this error if you compiled Darknet without cuDNN?

Unused field: 'uc_normalizer = 0.07'

Is only for Gaussian-yolo

Unused field: 'beta1 = 0.6'

Just isn't required at all.

mwindowshz commented 4 years ago

Hi I have Cuda 10.2

D:\Training\mscoco>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:32:27_Pacific_Daylight_Time_2019
Cuda compilation tools, release 10.2, V10.2.89

cudnn 7.6.5

visual studio 2019

CUDNN 7.4.2 is not compatible with this version.

I compiled with CUDNN_HALF, CUDNN

tryied to compile without CUDNN_HALF and without CUDNN not crashing but learning is NAN

 (next mAP calculation at 7329 iterations)
 10: nan, nan avg loss, 0.000000 rate, 31.047000 seconds, 640 images
Resizing, random_coef = 1.40

one line example: v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 137 Avg (IOU: nan, GIOU: nan), Class: nan, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 12, class_loss = 12.000000, iou_loss = 0.000000, total_loss = 12.000000

removing only CUDNN_HALF resulted with error

CUDA status Error: file: C:\Users\....\source\repos\darknet\src\dark_cuda.c : cuda_push_array() : line: 469 : build time: Mar 23 2020 - 11:46:32

 CUDA Error: unspecified launch failure

CUDA Error: unspecified launch failure: No error
Assertion failed: 0, file C:\Users\...\source\repos\darknet\src\utils.c, line 325

AlexeyAB commented 4 years ago

removing only CUDNN_HALF resulted with error

CUDA status Error: file: C:\Users....\source\repos\darknet\src\dark_cuda.c : cuda_push_array() : line: 469 : build time: Mar 23 2020 - 11:46:32

CUDA Error: unspecified launch failure

CUDA Error: unspecified launch failure: No error Assertion failed: 0, file C:\Users...\source\repos\darknet\src\utils.c, line 325

Can you show error with previous message? Did you train with -benchmark_layer -cuda_debug_sync flags?

mwindowshz commented 4 years ago

Hi did not understand Using normal compile with CUDNN and CUDNN_HALF, and using flags -benchmark_layer -cuda_debug_sync separately worked there was not crash, but training is very slow, so I did not complete training. should the flags be used together ?

AlexeyAB commented 4 years ago

should the flags be used together ?

Only for debugging to catch the error place. I can't reproduce your error.

mwindowshz commented 4 years ago

Hi Uploaded dump file of the crash can this help dump Thanks

VisionEp1 commented 4 years ago

hi alexy i have the same issue is there any update on this?

AlexeyAB / darknet

status Error dark_cuda.c : cuda_push_array() : Line 458 #4657