AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.65k stars 7.96k forks source link

Cuda Error + AVG Loss Nan #5547

Closed VisionEp1 closed 4 years ago

VisionEp1 commented 4 years ago

Hi, after a while it's once again me with some questions /bugs.

I just compiled and installed the latest darknet version. However on training a dataset which I used previously (or at least I am 90% sure) successfully I get strange behavior with multiple problems:

  1. Training stops once in a while:

`v3 (iou loss, Normalizer: (iou: 0.07, cls: 1.00) Region 161 Avg (IOU: 0.324191, GIOU: 0.239091), Class: 0.512335, Obj: 0.209440, No Obj: 0.196831, .5R: 0.307692, .75R: 0.000000, count: 13, class_loss = 30643.501953, iou_loss = 78.072266, total_loss = 30721.574219 CUDA status Error: file: X:\Bitbucket\COMPAILE-Dev\symbol_detection_train_2020\darknet-master\src\dark_cuda.c : cuda_push_array() : line: 469 : build time: May 7 2020 - 11:33:15

CUDA Error: an illegal instruction was encountered CUDA Error: an illegal instruction was encountered: No error`

  1. avg loss starts very high (normal), decreases very slow and then goes to -nan before 1k iterations

I train with 1gpu only and used multiple config files for example the yolov4_custom.cfg but also several yolov3 configs and same issues, which I had never before.

I also tried to minimize the dataset but again it's hard to test with the cuda error. I thought maybe something is incompatible (are there any non supported opencv versions, or intel mkl might be the problem?) Else is default cuda +cudnn. (build on windows with make like always) Any Ideas?

And thanks in advance

VisionEp1 commented 4 years ago

also just scimming the other issues it seems that this nan issue occurred multiple times in the last weeks. was there anything changed in the "core" of darknet?

AlexeyAB commented 4 years ago
VisionEp1 commented 4 years ago

image

image

ps: I also installed intel libs + another opencv lib (and rebuild with cmake ofc) but same result

AlexeyAB commented 4 years ago

image

image




However on training a dataset which I used previously (or at least I am 90% sure) successfully I get strange behavior with multiple problems:

should I just try any old version?

Yes, if you previously used Darknet successfully.

VisionEp1 commented 4 years ago

Hi, sadly i was unable to install cuda10.0 since cmake says cuda not found (it's in path, I restarted everything). i am going to reinstall cuda10.2 and give you the screenshots

VisionEp1 commented 4 years ago

image image it still goes nan after a few iterations:

image

note the file yolov4.cfg is the yolov4_custom.cfg with changed classes etc according to my dataset

AlexeyAB commented 4 years ago

Can you attach your cfg-file?

VisionEp1 commented 4 years ago

sure, i am also now 99% sure it's a lib issue.

I changed my dataset to 5 images per class (so training loss should be close to 0) and it still goes to nan. (I changed res a bit smaller so I see results faster)

`[net]

Testing

batch=1

subdivisions=1

Training

batch=64 subdivisions=16 width=320 height=320 channels=3 momentum=0.949 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1

learning_rate=0.001 burn_in=1000 max_batches = 500500 policy=steps steps=400000,450000 scales=.1,.1

cutmix=1

mosaic=1

:104x104 54:52x52 85:26x26 104:13x13 for 416

[convolutional] batch_normalize=1 filters=32 size=3 stride=1 pad=1 activation=mish

Downsample

[convolutional] batch_normalize=1 filters=64 size=3 stride=2 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[route] layers = -2

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=32 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[route] layers = -1,-7

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

Downsample

[convolutional] batch_normalize=1 filters=128 size=3 stride=2 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[route] layers = -2

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=mish

[route] layers = -1,-10

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

Downsample

[convolutional] batch_normalize=1 filters=256 size=3 stride=2 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[route] layers = -2

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=mish

[route] layers = -1,-28

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

Downsample

[convolutional] batch_normalize=1 filters=512 size=3 stride=2 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[route] layers = -2

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=mish

[route] layers = -1,-28

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

Downsample

[convolutional] batch_normalize=1 filters=1024 size=3 stride=2 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[route] layers = -2

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=mish

[shortcut] from=-3 activation=linear

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=mish

[route] layers = -1,-16

[convolutional] batch_normalize=1 filters=1024 size=1 stride=1 pad=1 activation=mish stopbackward=800

##########################

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

SPP

[maxpool] stride=1 size=5

[route] layers=-2

[maxpool] stride=1 size=9

[route] layers=-4

[maxpool] stride=1 size=13

[route] layers=-1,-3,-5,-6

End SPP

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[upsample] stride=2

[route] layers = 85

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[route] layers = -1, -3

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=512 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=512 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

[upsample] stride=2

[route] layers = 54

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

[route] layers = -1, -3

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=256 activation=leaky

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=256 activation=leaky

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

##########################

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=256 activation=leaky

[convolutional] size=1 stride=1 pad=1 filters=69 activation=linear

[yolo] mask = 0,1,2 anchors = 12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401 classes=18 num=9 jitter=.3 ignore_thresh = .7 truth_thresh = 1 scale_x_y = 1.2 iou_thresh=0.213 cls_normalizer=1.0 iou_normalizer=0.07 iou_loss=ciou nms_kind=greedynms beta_nms=0.6 max_delta=5 counters_per_class = 27, 248, 66, 1588, 8, 98, 140, 168, 81, 124, 31, 144, 101, 167, 26, 56, 105, 25

[route] layers = -4

[convolutional] batch_normalize=1 size=3 stride=2 pad=1 filters=256 activation=leaky

[route] layers = -1, -16

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=512 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=512 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=512 activation=leaky

[convolutional] size=1 stride=1 pad=1 filters=69 activation=linear

[yolo] mask = 3,4,5 anchors = 12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401 classes=18 num=9 jitter=.3 ignore_thresh = .7 truth_thresh = 1 scale_x_y = 1.1 iou_thresh=0.213 cls_normalizer=1.0 iou_normalizer=0.07 iou_loss=ciou nms_kind=greedynms beta_nms=0.6 max_delta=5 counters_per_class = 27, 248, 66, 1588, 8, 98, 140, 168, 81, 124, 31, 144, 101, 167, 26, 56, 105, 25

[route] layers = -4

[convolutional] batch_normalize=1 size=3 stride=2 pad=1 filters=512 activation=leaky

[route] layers = -1, -37

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] size=1 stride=1 pad=1 filters=69 activation=linear

[yolo] mask = 6,7,8 anchors = 12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401 classes=18 num=9 jitter=.3 ignore_thresh = .7 truth_thresh = 1 random=0 scale_x_y = 1.05 iou_thresh=0.213 cls_normalizer=1.0 iou_normalizer=0.07 iou_loss=ciou nms_kind=greedynms beta_nms=0.6 max_delta=5 counters_per_class = 27, 248, 66, 1588, 8, 98, 140, 168, 81, 124, 31, 144, 101, 167, 26, 56, 105, 25`

VisionEp1 commented 4 years ago

i also tested it now with csresnext50-panet-spp-original-optimal.cfg: here: https://pastebin.com/raw/82F85v8t

same result. so I think it is most likely a) not a cfg file issue b) I got something totally wrong with all config files

VisionEp1 commented 4 years ago

One more thing: my train.txt contains images like this:

X:\Datasets\symbols\train_5\146.jpg

and I train from a completely different folder (X:\Bitbucket\test\symbol_detection_train_2020\train_1).

But I also tested with -show_imgs and all seems fine just in case that might change anything.

and now I removed all but 5 images in total for training, still jumps to -nan

AlexeyAB commented 4 years ago

But I also tested with -show_imgs and all seems fine just in case that might change anything.

VisionEp1 commented 4 years ago

But again i don't think it should be hyper parameters with only 5 images in the train set. i think it should learn them by heart even if they would just be noise (which they are not).

any other ideas :( ? i am very happy to test out stuff to help you just let me know what.

(I still have try old darknet versions on my todo)

VisionEp1 commented 4 years ago

update: nan also happens with low learning rate just took like 8ish times more iterations

AlexeyAB commented 4 years ago

At what number of iterations did you get Nan? Can you attach this 5 images + 5 txt labels in zip file?

VisionEp1 commented 4 years ago

sure:

145.zip

VisionEp1 commented 4 years ago

could you reproduce the issue?

i am going to set up a different machine today with 2 1080ti (so no cuddn half and that stuff). and a fresh windows - see if that works.

any visual studio compiler version you recommend? is it safe to use the 2019 version with darknet?

AlexeyAB commented 4 years ago
AlexeyAB commented 4 years ago

I successfully trained your dataset with your cfg-file without any changes, for 1100 iterations: image

VisionEp1 commented 4 years ago

thanks for the update.

i use the simple default command with no pre trained weights: .\darknet.exe detector train .\symb.data .\yolov4.cfg

I am just setting up darknet with Win10 x64, MSVS2017, CUDA 10.0, cuDNN 7.4.2, OpenCV 4.2.0, RTX 2080TI and test on this fresh machine. If it works the "bug" must have something todo with:

so if i can provide any more information which might be helpful 2 you let me know. (other then if i can reproduce on new pc ofc)

VisionEp1 commented 4 years ago

ok so it works on the fresh installed pc: i try to change stuff and make those 2 systems more similar (at least software whise), the new system has 1080ti srry for typo

i would do it in this order

  1. cudnn_half flag (but that should be ignored right first 3k iterations)
  2. cuda+cnn version
  3. swap gpus?
AlexeyAB commented 4 years ago

i use the simple default command with no pre trained weights: .\darknet.exe detector train .\symb.data .\yolov4.cfg

darknet.exe detector train data\145\obj.data data\145\yolo.cfg yolov4.conv.137 -map

Read manual: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

Right? So In case it works on the fresh pc, i will try to downgrade cuda+cudnn first and see if that helps? I do not only want to have a working setup i want to help you fix the issue if that makes sense.

Yes.

swap gpus?

Try this. May be there is a hardware bug in GPU. This has already been here several times: https://github.com/AlexeyAB/darknet/issues?q=is%3Aissue+label%3A%22Hardware+bug%22+is%3Aclosed

RTX 2080 Ti Owners Complain of Defects, Nvidia Responds (Update) https://www.tomshardware.com/news/rtx-2080-ti-gpu-defects-launch,37995.html

cuda+cnn version

May be some of CUDA/cuDNN versions has a bug. This has already been here several times: https://github.com/AlexeyAB/darknet/issues/5007#issuecomment-600160767

VisionEp1 commented 4 years ago
VisionEp1 commented 4 years ago

Finally: It IS indeed some sort of Hardware issue, when i plugged out all gpus but one it works.

So i have to do fine testing, maybe it's the pcie extender cabels maybe its the gpus or a combination.

but anyway it's 100% a hardware related problem so this can be closed. thanks a ton for you help and especially for testing it on you own hardware

AlexeyAB commented 4 years ago

So

  1. If you plugged 3 x GPUs, and run training on 1 GPU then you get this issue?
  2. but if you plugged out 2 GPUs, and keep only 1 GPU, then you don't get this issue?
  3. And you get the same issue on another PC, with another OS, cuDNN, CUDA, OpenCV version, if you do the same as in the first 2 points?
varghesealex90 commented 4 years ago

I ran into similar issues. ./darknet detector train cfg/facemask.data cfg/yolov3-tiny-facemask.cfg yolov3-tiny.conv.15 -map threw up CUDA error at the 1000 iteration.

Post removing -map, the code started working.

The system has 7 v100 and I was trying to train on GPU 2.