[BUG] training YOLOv4-tiny illegal memory access

KristianBalaj commented 2 years ago

My training session is crashing on the following:

(next mAP calculation at 1053 iterations) ]2;1053/5000: loss=0.2 hours left=1.6
 1053: 0.166034, 0.199170 avg loss, 0.002610 rate, 1.101552 seconds, 67392 images, 1.565802 hours left

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
4CUDA status Error: file: ./src/network_kernels.cu : () : line: 735 : build time: Mar 20 2022 - 13:44:48 

 CUDA Error: an illegal memory access was encountered
Darknet error location: ./src/dark_cuda.c, check_error, line #69
CUDA Error: an illegal memory access was encountered: Success

I'm using the following pretrained weights: https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-tiny.conv.29

I'm trying to train my following model:

cfg file

``` [net] # Testing #batch=1 #subdivisions=1 # Training batch=64 subdivisions=16 width=704 height=512 channels=3 momentum=0.9 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1 learning_rate=0.00261 burn_in=1000 max_batches = 5000 policy=steps steps=11200,12600 scales=.1,.1 #weights_reject_freq=1001 #ema_alpha=0.9998 #equidistant_point=1000 #num_sigmas_reject_badlabels=3 #badlabels_rejection_percentage=0.2 [convolutional] batch_normalize=1 filters=32 size=3 stride=2 pad=1 activation=leaky [convolutional] batch_normalize=1 filters=64 size=3 stride=2 pad=1 activation=leaky [convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=leaky [route] layers=-1 groups=2 group_id=1 [convolutional] batch_normalize=1 filters=32 size=3 stride=1 pad=1 activation=leaky [convolutional] batch_normalize=1 filters=32 size=3 stride=1 pad=1 activation=leaky [route] layers = -1,-2 [convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=leaky [route] layers = -6,-1 [maxpool] size=2 stride=2 [convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=leaky [route] layers=-1 groups=2 group_id=1 [convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=leaky [convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=leaky [route] layers = -1,-2 [convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky [route] layers = -6,-1 [maxpool] size=2 stride=2 [convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky [route] layers=-1 groups=2 group_id=1 [convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=leaky [convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=leaky [route] layers = -1,-2 [convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky [route] layers = -6,-1 [maxpool] size=2 stride=2 [convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky ################################## [convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky [convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky [convolutional] size=1 stride=1 pad=1 filters=36 activation=linear [yolo] mask = 3,4,5 anchors = 10,14, 23,27, 37,58, 81,82, 135,169, 344,319 classes=7 num=6 jitter=.3 scale_x_y = 1.05 cls_normalizer=1.0 iou_normalizer=0.07 iou_loss=ciou ignore_thresh = .7 truth_thresh = 1 random=0 resize=1.5 nms_kind=greedynms beta_nms=0.6 #new_coords=1 #scale_x_y = 2.0 [route] layers = -4 [convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky [upsample] stride=2 [route] layers = -1, 23 [convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky [convolutional] size=1 stride=1 pad=1 filters=36 activation=linear [yolo] mask = 1,2,3 anchors = 10,14, 23,27, 37,58, 81,82, 135,169, 344,319 classes=7 num=6 jitter=.3 scale_x_y = 1.05 cls_normalizer=1.0 iou_normalizer=0.07 iou_loss=ciou ignore_thresh = .7 truth_thresh = 1 random=0 resize=1.5 nms_kind=greedynms beta_nms=0.6 #new_coords=1 #scale_x_y = 2.0 ```

I run the training with the following command on kaggle:

!./darknet detector train diploma/obj_categorical.data yolov4-tiny-512x704_7_categories.cfg yolov4-tiny.conv.29 -dont_show -map

I've been training YOLOv3 and YOLOv4 previously with no problems, but the YOLOv4-tiny fails like this.

Lastly, I was training YOLOv4 on 10th March with success. I'm always using the last master branch for every training.

KristianBalaj commented 2 years ago

The problem remains even when at this commit b4d03f88028594c3073c41f60cd469d03fb3ee8e

Training from scratch it fails at 1000th epoch:

(next mAP calculation at 1000 iterations) ]2;1000/5000: loss=0.4 hours left=1.2
 1000: 0.361770, 0.379232 avg loss, 0.002610 rate, 1.336520 seconds, 64000 images, 1.221786 hours left

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
4CUDA status Error: file: ./src/network_kernels.cu : () : line: 735 : build time: Mar 22 2022 - 20:33:47 

 CUDA Error: an illegal memory access was encountered
Darknet error location: ./src/dark_cuda.c, check_error, line #69
CUDA Error: an illegal memory access was encountered: Success

KristianBalaj commented 2 years ago

I've resolved it some time ago, but sorry I can't really remember what was the actual issue. I just think it was something with subdivisions.

chgoatherd commented 2 years ago

I modify function "void copy_weights_net(network net_train, network* net_map)" to solved it.

void copy_weights_net(network net_train, network* net_map) { int k;

for (k = 0; k < net_train.n; ++k) { layer* l = &(net_train.layers[k]); layer tmp_layer;

copy_cudnn_descriptors(net_train.layers[k], &tmp_layer);
net_map->layers[k] = net_train.layers[k];
copy_cudnn_descriptors(tmp_layer, &net_train.layers[k]);

if (l->type == CRNN)
{
    layer tmp_input_layer, tmp_self_layer, tmp_output_layer;

    copy_cudnn_descriptors(*net_train.layers[k].input_layer, &tmp_input_layer);
    copy_cudnn_descriptors(*net_train.layers[k].self_layer, &tmp_self_layer);
    copy_cudnn_descriptors(*net_train.layers[k].output_layer, &tmp_output_layer);
    net_map->layers[k].input_layer = net_train.layers[k].input_layer;
    net_map->layers[k].self_layer = net_train.layers[k].self_layer;
    net_map->layers[k].output_layer = net_train.layers[k].output_layer;
    //net_map->layers[k].output_gpu = net_map->layers[k].output_layer->output_gpu;  // already copied out of if()

    copy_cudnn_descriptors(tmp_input_layer, net_train.layers[k].input_layer);
    copy_cudnn_descriptors(tmp_self_layer, net_train.layers[k].self_layer);
    copy_cudnn_descriptors(tmp_output_layer, net_train.layers[k].output_layer);
}
else if (l->input_layer) // for AntiAliasing
{
    layer tmp_input_layer;

    copy_cudnn_descriptors(*net_train.layers[k].input_layer, &tmp_input_layer);
    net_map->layers[k].input_layer = net_train.layers[k].input_layer;
    copy_cudnn_descriptors(tmp_input_layer, net_train.layers[k].input_layer);
}

net_map->layers[k].batch = 1;
net_map->layers[k].steps = 1;
net_map->layers[k].train = 0;

} }

AlexeyAB / darknet

[BUG] training YOLOv4-tiny illegal memory access #8424