AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.63k stars 7.95k forks source link

Subdivisions with CUDA error #3281

Open iskandari opened 5 years ago

iskandari commented 5 years ago

Hi Alexey - thanks for maintaining this amazing repo. I am trying to detect just one custom class and have followed the instructions exactly but am erroring out and can't figure out the issue. I am running code on Ubuntu with Cuda enabled - I compiled darknet with make

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
./darknet detector train /home/ubuntu/darknet/AlexeyAB/darknet/build/darknet/x64/data/obj.data /home/ubuntu/darknet/AlexeyAB/darknet/build/darknet/x64/cfg/yolo-obj.cfg  /home/ubuntu/darknet/AlexeyAB/darknet/build/darknet/x64/darknet53.conv.74 -dont_show

I get this far:

Total BFLOPS 65.290
 Allocate additional workspace_size = 49.84 MB
Loading weights from /home/ubuntu/darknet/AlexeyAB/darknet/build/darknet/x64/darknet53.conv.74...
 seen 64
Done!
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
Resizing
608 x 608
 Try to set subdivisions=64 in your cfg-file.
CUDA status Error: file: ./src/dark_cuda.c : () : line: 235 : build time: May 30 2019 - 11:14:51
CUDA Error: out of memory
CUDA Error: out of memory: File exists
darknet: ./src/utils.c:293: error: Assertion `0' failed.
Aborted (core dumped)

The following is my file system:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             15G     0   15G   0% /dev
tmpfs           3.0G   17M  3.0G   1% /run
/dev/xvda1       49G   41G  8.4G  83% /
tmpfs            15G     0   15G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            15G     0   15G   0% /sys/fs/cgroup
/dev/loop1       17M   17M     0 100% /snap/amazon-ssm-agent/784
/dev/loop0       88M   88M     0 100% /snap/core/5742
/dev/loop2       89M   89M     0 100% /snap/core/6964
/dev/loop3       18M   18M     0 100% /snap/amazon-ssm-agent/1335
tmpfs           3.0G  4.0K  3.0G   1% /run/user/1000

I have set my subdivisions to 64 in my yolo-obj.cfg and still experience the same error. How should I proceed here? Thanks for your help

AlexeyAB commented 5 years ago

@iskandari Hi,

Show output of command nvidia-smi

Also try to train with random=0 in the last [yolo]-layer.

eddex commented 5 years ago

I just encountered the same issue. I was able to fix it by changing value for subdivisions in the yolov3-tiny-obj.cfg to 64.

batch=64
subdivisions=64
> nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 780     Off  | 00000000:02:00.0 N/A |                  N/A |
| 40%   60C    P0    N/A /  N/A |   1114MiB /  3018MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+