"CUDA Error: out of memory" on Win10/4x 980 Tis after 10,500 iterations

Hello, I checked the forum for similar issues and the only one that seemed somewhat similar and related was this one:

https://github.com/AlexeyAB/darknet/issues/4119

...but ultimately it did not lead me to a solution. I am running darknet on a Windows 10 machine compiled without Python bindings but with MKL + TBB using Cmake-GUI (v3.16.0). I'm using OpenCV 4.1.2 and CUDA 10.2. Generation and compilation is with Visual Studio Community 2019.

Here are my specs dumped from Speccy during training, including environment vars:

Summary
        Operating System
            Windows 10 Pro 64-bit
        CPU
            Intel Xeon E5 v3 @ 2.50GHz  30 °C
            Haswell-E/EP 22nm Technology
            Intel Xeon E5 v3 @ 2.50GHz  59 °C
            Haswell-E/EP 22nm Technology
        RAM
            64.0GB Single-Channel Unknown @ 1064MHz (15-15-15-35)
        Motherboard
            ASUSTeK COMPUTER INC. Z10PE-D8 WS (SOCKET 1)    100 °C
        Graphics
            DELL UP3216Q (3840x2160@30Hz)
            6144MB NVIDIA GeForce GTX 980 Ti (Dell) 83 °C
            6144MB NVIDIA GeForce GTX 980 Ti (Dell) 83 °C
            6144MB NVIDIA GeForce GTX 980 Ti (NVIDIA)   83 °C
            6144MB NVIDIA GeForce GTX 980 Ti (HP)   75 °C
            ASPEED Technology ASPEED Graphics Family(WDDM) (ASUStek Computer Inc)
            ForceWare version: 441.41
            SLI Disabled
        Storage
            1863GB Samsung SSD 970 EVO Plus 2TB (Unknown (SSD))
            3726GB Western Digital WDC WD40EZRX-00SPEB0 (SATA ) 31 °C
            931GB PHD 3.0 Silicon-Power USB Device (USB (SATA) )    32 °C
            3726GB Western Digital WDC WD40EZRX-00SPEB0 (SATA ) 32 °C
            3726GB Western Digital WDC WD40EZRX-00SPEB0 (SATA ) 32 °C
            3726GB Western Digital WDC WD40EZRX-00SPEB0 (SATA ) 30 °C
            5589GB Seagate ST6000DX000-1H217Z (SATA )   30 °C
            5589GB Seagate ST6000DX000-1H217Z (SATA )   30 °C
            5589GB Seagate ST6000DX000-1H217Z (SATA )   31 °C
            5589GB Seagate ST6000DX000-1H217Z (SATA )   29 °C
        Optical Drives
            ASUS BW-16D1HT
        Audio
            NVIDIA High Definition Audio
Operating System
    Windows 10 Pro 64-bit
    Computer type: Desktop
    Installation Date: 8/21/2019 6:55:10 AM
...
        Environment Variables
            ...
            SystemRoot  C:\Windows
                ...
                Machine Variables
                    ComSpec C:\Windows\system32\cmd.exe
                    CUDA_PATH   C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2
                    CUDA_PATH_V10_2 C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2
                    DriverData  C:\Windows\System32\Drivers\DriverData
                    INTEL_LICENSE_FILE  C:\Program Files (x86)\Common Files\Intel\Licenses
                    NUMBER_OF_PROCESSORS    48
                    NVCUDASAMPLES10_2_ROOT  C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.2
                    NVCUDASAMPLES_ROOT  C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.2
                    NVTOOLSEXT_PATH C:\Program Files\NVIDIA Corporation\NvToolsExt\
                    OpenCV_DIR  C:\Program Files\OpenCV\opencv-4.1.2\build
                    OS  Windows_NT
                    Path    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin
                    C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\libnvvp
                    C:\Windows\system32
                    C:\Windows
                    C:\Windows\System32\Wbem
                    C:\Windows\System32\WindowsPowerShell\v1.0\
                    C:\Windows\System32\OpenSSH\
                    C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common
                    C:\Program Files\CMake\bin
                    C:\Users\M\AppData\Local\Programs\Python\Python38
                    C:\Program Files\dotnet\
                    C:\Program Files\Microsoft SQL Server\130\Tools\Binn\
                    C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn\
                    C:\Program Files\NVIDIA Corporation\Nsight Compute 2019.5.0\
                    C:\Program Files\OpenCV\opencv-4.1.2\build\bin\Debug
                    C:\Program Files\OpenCV\opencv-4.1.2\build\bin\Release
                    C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\redist\intel64_win\tbb\vc_mt
                    C:\Program Files\Java\jdk-13.0.1\bin
                    C:\Program Files\NVIDIA Corporation\NVSMI
                    PATH_TO_NINJA   C:\Program Files (x86)\Ninja\ninja-win
                    PATH_TO_OPENCV_CONTRIB  C:\Program Files\OpenCV\opencv_contrib-4.1.2
                    PATH_TO_OPENCV_SOURCE   C:\Program Files\OpenCV\opencv-4.1.2
                    PATHEXT .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
                    PROCESSOR_ARCHITECTURE  AMD64
                    PROCESSOR_IDENTIFIER    Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
                    PROCESSOR_LEVEL 6
                    PROCESSOR_REVISION  3f02
                    PSModulePath    %ProgramFiles%\WindowsPowerShell\Modules
                    C:\Windows\system32\WindowsPowerShell\v1.0\Modules
                    C:\Program Files\Intel\
                    TEMP    C:\Windows\TEMP
                    TMP C:\Windows\TEMP
                    USERNAME    SYSTEM
                    VS2019INSTALLDIR    C:\Program Files (x86)\Microsoft Visual Studio\2019\Community
                    windir  C:\Windows

Attempts to fix so far

I originally had 4 Titan Black GPUs in the machine and wanted to increase training speed on a budget, so I purchased 4 secondhand 980 Tis. After turning on the BIOS flag for 4G addressing and reinstalling graphics drivers, everything seemed to work great. I am training a custom dataset for a project but have not captured the data yet, although I have successfully trained a single test class in about 1300 iterations. I wanted to conduct a test to approximate training time for the full dataset, so I decided to train the MS Coco dataset. Unfortunately, after about 9,000 iterations, which took about 8 hours to train, it stopped with the error "CUDA Error: No Error". That's when I discovered issue #4119.

I realized I needed to recompile OpenCV with CUDA because the compute level for the Titan Black is only 3.5, while for the 980 Ti it is 5.2, so I did so. However, since then (I have run about 5 training sessions), it fails every time after about 10,500 iterations with the error: "CUDA Error: Out of memory".

I tried running it with cuda-memcheck, but it simply hangs after loading the network with the message "Loaded: 0.000000 seconds". I tried compiling gdb for Windows but ran into issues with that; I may try to come back to that later. I settled on attaching to the darknet process via the Visual Studio debugger after compiling the Debug versions of OpenCV and darknet and copying over the debug symbols. Unfortunately, as CUDA itself is not compiled with debug symbols, it didn't yield much better results; I was only able to verify a cudaStatus of 2, or cudaOutOfMemory. It also runs about 30% slower, so it takes the Debug version about 12 hours to get to this many iterations.

Here are what a couple of the exits looked like in the CLI:

Session 1:

10514: 4.122525, 4.392693 avg loss, 0.004000 rate, 9.641000 seconds, 2691584 images
Resizing
480 x 480
CUDA status Error: file: C:\Program Files (x86)\Darknet-53-AlexeyAB\src\dark_cuda.c : cuda_free() : line: 423 : build time: Dec 14 2019 - 21:41:09
CUDA Error: out of memory
CUDA Error: out of memory: No error
Assertion failed: 0, file C:\Program Files (x86)\Darknet-53-AlexeyAB\src\utils.c, line 295

Session 2:

 10514: 4.363719, 4.208237 avg loss, 0.004000 rate, 14.719000 seconds, 2691584 images
Resizing
448 x 448
 try to allocate additional workspace_size = 52.43 MB
 CUDA allocate done!
 try to allocate additional workspace_size = 52.43 MB
 CUDA allocate done!
 try to allocate additional workspace_size = 52.43 MB
 CUDA allocate done!
 try to allocate additional workspace_size = 52.43 MB
 CUDA allocate done!
Loaded: 3.250000 seconds
 Try to set subdivisions=64 in your cfg-file.
CUDA status Error: file: C:\Program Files (x86)\Darknet-53-AlexeyAB\src\dark_cuda.c : cuda_make_array() : line: 357 : build time: Dec 14 2019 - 21:41:09
 CUDA Error: out of memory
Try to set subdivisions=64 in your cfg-file.
CUDA status Error: file: C:\Program Files (x86)\Darknet-53-AlexeyAB\src\dark_cuda.c : cuda_make_array() : line: 357 : build time: Dec 14 2019 - 21:41:09
 Try to CUDA Error: out of memory
set subdivisions=64 in your cfg-file.
CUDA status Error: file: C:\Program Files (x86)\Darknet-53-AlexeyAB\src\dark_cuda.c : cuda_make_array() : line: 357 : build time: Dec 14 2019 - 21:41:09
 CUDA Error: out of memory
Try to set subdivisions=64 in your cfg-file.
CUDA status Error: file: C:\Program Files (x86)\Darknet-53-AlexeyAB\src\dark_cuda.c : cuda_make_array() : line: 357 : build time: Dec 14 2019 - 21:41:09
CUDA Error: out of memory
CUDA Error: out of memory: No error
Assertion failed: 0, file C:\Program Files (x86)\Darknet-53-AlexeyAB\src\utils.c, line 295
^C

Here is the top portion of the Makefile with which I compiled darknet:

GPU=1
CUDNN=1
CUDNN_HALF=0
OPENCV=1
AVX=0
OPENMP=0
LIBSO=0
ZED_CAMERA=0

# set GPU=1 and CUDNN=1 to speedup on GPU
# set CUDNN_HALF=1 to further speedup 3 x times (Mixed-precision on Tensor Cores) GPU: Volta, Xavier, Turing and higher
# set AVX=1 and OPENMP=1 to speedup on CPU (if error occurs then set AVX=0)

DEBUG=0

ARCH= -gencode arch=compute_30,code=sm_30 \
      -gencode arch=compute_35,code=sm_35 \
      -gencode arch=compute_50,code=[sm_50,compute_50] \
      -gencode arch=compute_52,code=[sm_52,compute_52] \
      -gencode arch=compute_61,code=[sm_61,compute_61]

OS := $(shell uname)

I did not alter any values before that top section. CUDNN_HALF was 1, but I set it to 0 after reading #4119. My GPUs are definitely being utilized as verified in MSI Afterburner - and what is strange is that when the out of memory error occurs, no card is using much more than 2 GB of RAM, and they all have 6 GB of RAM... so memory really should not be an issue. When I rebuilt OpenCV, I did set CUDA_GENERATION to Auto so it would properly generate the correct ARCH, which it did.

Here's the output from Cmake-GUI for configuration and generation of OpenCV with CUDA, MKL & TBB:

CUDA NVCC target flags: -gencode;arch=compute_52,code=sm_52;-D_FORCE_INLINES
Could not find OpenBLAS include. Turning OpenBLAS_FOUND off
Could not find OpenBLAS lib. Turning OpenBLAS_FOUND off
Could NOT find BLAS (missing: BLAS_LIBRARIES) 
LAPACK requires BLAS
A library with LAPACK API not found. Please specify library location.
VTK is not found. Please set -DVTK_DIR in CMake to VTK build directory, or to VTK install subdirectory with VTKConfig.cmake file
OpenCV Python: during development append to PYTHONPATH: C:/Program Files/OpenCV/opencv-4.1.2/build/python_loader
Caffe:   NO
Protobuf:   NO
Glog:   NO
freetype2:   NO
harfbuzz:    NO
Module opencv_ovis disabled because OGRE3D was not found
No preference for use of exported gflags CMake configuration set, and no hints for include/library directories provided. Defaulting to preferring an installed/exported gflags CMake configuration if available.
Failed to find installed gflags CMake configuration, searching for gflags build directories exported with CMake.
Failed to find gflags - Failed to find an installed/exported CMake configuration for gflags, will perform search for installed gflags components.
Failed to find gflags - Could not find gflags include directory, set GFLAGS_INCLUDE_DIR to directory containing gflags/gflags.h
Failed to find glog - Could not find glog include directory, set GLOG_INCLUDE_DIR to directory containing glog/logging.h
Module opencv_sfm disabled because the following dependencies are not found: Eigen Glog/Gflags
Tesseract:   NO
Processing WORLD modules...
    module opencv_cudev...
    module opencv_core...
    module opencv_cudaarithm...
    module opencv_flann...
    module opencv_imgproc...
    module opencv_ml...
    module opencv_phase_unwrapping...
    module opencv_plot...
    module opencv_quality...
    module opencv_reg...
    module opencv_surface_matching...
    module opencv_cudafilters...
    module opencv_cudaimgproc...
    module opencv_cudawarping...
    module opencv_dnn...
Registering hook 'INIT_MODULE_SOURCES_opencv_dnn': C:/Program Files/OpenCV/opencv-4.1.2/modules/dnn/cmake/hooks/INIT_MODULE_SOURCES_opencv_dnn.cmake
opencv_dnn: filter out cuda4dnn source code
    module opencv_features2d...
    module opencv_fuzzy...
    module opencv_hfs...
    module opencv_imgcodecs...
    module opencv_line_descriptor...
    module opencv_photo...
    module opencv_saliency...
    module opencv_text...
    module opencv_videoio...
    module opencv_xphoto...
    module opencv_calib3d...
    module opencv_cudacodec...
    module opencv_cudafeatures2d...
    module opencv_cudastereo...
    module opencv_datasets...
    module opencv_dnn_superres...
    module opencv_highgui...
    module opencv_objdetect...
    module opencv_rgbd...
    module opencv_shape...
    module opencv_structured_light...
    module opencv_video...
    module opencv_xfeatures2d...
    module opencv_ximgproc...
    module opencv_xobjdetect...
    module opencv_aruco...
    module opencv_bgsegm...
    module opencv_bioinspired...
    module opencv_ccalib...
    module opencv_cudabgsegm...
    module opencv_cudalegacy...
    module opencv_cudaobjdetect...
    module opencv_dnn_objdetect...
    module opencv_dpm...
    module opencv_face...
    module opencv_optflow...
    module opencv_stitching...
    module opencv_tracking...
    module opencv_cudaoptflow...
    module opencv_stereo...
    module opencv_superres...
    module opencv_videostab...
Processing WORLD modules... DONE

General configuration for OpenCV 4.1.2 =====================================
  Version control:               unknown

  Extra modules:
    Location (extra):            C:/Program Files/OpenCV/opencv_contrib-4.1.2/modules
    Version control (extra):     unknown

  Platform:
    Timestamp:                   2019-12-05T02:06:47Z
    Host:                        Windows 10.0.18362 AMD64
    CMake:                       3.16.0
    CMake generator:             Visual Studio 16 2019
    CMake build tool:            C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/MSBuild/Current/Bin/MSBuild.exe
    MSVC:                        1924

  CPU/HW features:
    Baseline:                    SSE SSE2 SSE3
      requested:                 SSE3
    Dispatched code generation:  SSE4_1 SSE4_2 FP16 AVX AVX2 AVX512_SKX
      requested:                 SSE4_1 SSE4_2 AVX FP16 AVX2 AVX512_SKX
      SSE4_1 (15 files):         + SSSE3 SSE4_1
      SSE4_2 (2 files):          + SSSE3 SSE4_1 POPCNT SSE4_2
      FP16 (1 files):            + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 AVX
      AVX (5 files):             + SSSE3 SSE4_1 POPCNT SSE4_2 AVX
      AVX2 (28 files):           + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2
      AVX512_SKX (6 files):      + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2 AVX_512F AVX512_COMMON AVX512_SKX

  C/C++:
    Built as dynamic libs?:      YES
    C++ Compiler:                C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/VC/Tools/MSVC/14.24.28314/bin/Hostx64/x64/cl.exe  (ver 19.24.28314.0)
    C++ flags (Release):         /DWIN32 /D_WINDOWS /W4 /GR  /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:fast     /EHa /wd4127 /wd4251 /wd4324 /wd4275 /wd4512 /wd4589 /MP48   /MD /O2 /Ob2 /DNDEBUG 
    C++ flags (Debug):           /DWIN32 /D_WINDOWS /W4 /GR  /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:fast     /EHa /wd4127 /wd4251 /wd4324 /wd4275 /wd4512 /wd4589 /MP48   /MDd /Zi /Ob0 /Od /RTC1 
    C Compiler:                  C:/Program Files (x86)/Microsoft Visual Studio/2019/Community/VC/Tools/MSVC/14.24.28314/bin/Hostx64/x64/cl.exe
    C flags (Release):           /DWIN32 /D_WINDOWS /W3  /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:fast       /MP48    /MD /O2 /Ob2 /DNDEBUG 
    C flags (Debug):             /DWIN32 /D_WINDOWS /W3  /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:fast       /MP48  /MDd /Zi /Ob0 /Od /RTC1 
    Linker flags (Release):      /machine:x64  /INCREMENTAL:NO 
    Linker flags (Debug):        /machine:x64  /debug /INCREMENTAL 
    ccache:                      NO
    Precompiled headers:         NO
    Extra dependencies:          opengl32 glu32 cudart_static.lib nppc.lib nppial.lib nppicc.lib nppicom.lib nppidei.lib nppif.lib nppig.lib nppim.lib nppist.lib nppisu.lib nppitc.lib npps.lib cublas.lib cufft.lib -LIBPATH:C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.2/lib/x64
    3rdparty dependencies:

  OpenCV modules:
    To be built:                 aruco bgsegm bioinspired calib3d ccalib core cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev datasets dnn dnn_objdetect dnn_superres dpm face features2d flann fuzzy hfs highgui img_hash imgcodecs imgproc line_descriptor ml objdetect optflow phase_unwrapping photo plot quality reg rgbd saliency shape stereo stitching structured_light superres surface_matching text tracking ts video videoio videostab world xfeatures2d ximgproc xobjdetect xphoto
    Disabled:                    gapi
    Disabled by dependency:      -
    Unavailable:                 cnn_3dobj cvv freetype hdf java js matlab ovis python2 python3 sfm viz
    Applications:                tests perf_tests examples apps
    Documentation:               NO
    Non-free algorithms:         YES

  Windows RT support:            NO

  GUI: 
    Win32 UI:                    YES
    OpenGL support:              YES (opengl32 glu32)
    VTK support:                 NO

  Media I/O: 
    ZLib:                        build (ver 1.2.11)
    JPEG:                        build-libjpeg-turbo (ver 2.0.2-62)
    WEBP:                        build (ver encoder: 0x020e)
    PNG:                         build (ver 1.6.37)
    TIFF:                        build (ver 42 - 4.0.10)
    JPEG 2000:                   build (ver 1.900.1)
    OpenEXR:                     build (ver 2.3.0)
    HDR:                         YES
    SUNRASTER:                   YES
    PXM:                         YES
    PFM:                         YES

  Video I/O:
    DC1394:                      NO
    FFMPEG:                      YES (prebuilt binaries)
      avcodec:                   YES (58.54.100)
      avformat:                  YES (58.29.100)
      avutil:                    YES (56.31.100)
      swscale:                   YES (5.5.100)
      avresample:                YES (4.0.0)
    GStreamer:                   NO
    DirectShow:                  YES
    Media Foundation:            YES
      DXVA:                      YES
    Intel Media SDK:             NO

  Parallel framework:            Concurrency

  Trace:                         YES (with Intel ITT)

  Other third-party libraries:
    Intel IPP:                   2019.0.0 Gold [2019.0.0]
           at:                   C:/Program Files/OpenCV/opencv-4.1.2/build/3rdparty/ippicv/ippicv_win/icv
    Intel IPP IW:                sources (2019.0.0)
              at:                C:/Program Files/OpenCV/opencv-4.1.2/build/3rdparty/ippicv/ippicv_win/iw
    Lapack:                      NO
    Eigen:                       NO
    Custom HAL:                  NO
    Protobuf:                    build (3.5.1)

  NVIDIA CUDA:                   YES (ver 10.2, CUFFT CUBLAS FAST_MATH)
    NVIDIA GPU arch:             52
    NVIDIA PTX archs:

  cuDNN:                         NO

  OpenCL:                        YES (NVD3D11)
    Include path:                C:/Program Files/OpenCV/opencv-4.1.2/3rdparty/include/opencl/1.2
    Link libraries:              Dynamic load

  Python (for build):            C:/Users/M/AppData/Local/Programs/Python/Python38/python.exe

  Java:                          
    ant:                         NO
    JNI:                         C:/Program Files/Java/jdk-13.0.1/include C:/Program Files/Java/jdk-13.0.1/include/win32 C:/Program Files/Java/jdk-13.0.1/include
    Java wrappers:               NO
    Java tests:                  NO

  Install to:                    C:/Program Files/OpenCV/opencv-4.1.2/build/install
-----------------------------------------------------------------

Configuring done
Generating done

...all projects built successfully in the VS solution.

Here's the Cmake-GUI output for darknet:

Autodetected CUDA architecture(s):  5.2 5.2 5.2 5.2
Building with CUDA flags: -gencode;arch=compute_52,code=sm_52
Your setup does not supports half precision (it requires CC >= 7.5)
PThreads_windows_DLL_DIR: C:/Program Files (x86)/Darknet-53-AlexeyAB/3rdparty/pthreads/include/../bin
CMAKE_CUDA_FLAGS: -gencode arch=compute_52,code=sm_52 -Wno-deprecated-declarations -Xcompiler="/wd4013,/wd4018,/wd4028,/wd4047,/wd4068,/wd4090,/wd4101,/wd4113,/wd4133,/wd4190,/wd4244,/wd4267,/wd4305,/wd4477,/wd4996,/wd4819,/fp:fast,/DGPU,/DCUDNN,/DOPENCV" -D_WINDOWS -Xcompiler="/W3 /GR /EHsc"
Configuring done
Generating done

Again, all projects build successfully for both Debug and Release in the VS solution.

Here is the output of nvidia-smi:

Sat Dec 14 21:47:29 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 441.41       Driver Version: 441.41       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti WDDM  | 00000000:03:00.0  On |                  N/A |
| 22%   43C    P8    17W / 250W |    935MiB /  6144MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 980 Ti WDDM  | 00000000:04:00.0 Off |                  N/A |
| 22%   39C    P8    14W / 250W |     46MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 980 Ti WDDM  | 00000000:81:00.0 Off |                  N/A |
| 22%   38C    P8    15W / 250W |     46MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 980 Ti WDDM  | 00000000:82:00.0 Off |                  N/A |
| 22%   33C    P8    15W / 250W |     46MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       988    C+G   ...o\2019\Community\Common7\IDE\devenv.exe N/A      |
|    0      4000    C+G   ...hell.Experiences.TextInput.InputApp.exe N/A      |
|    0      5960    C+G   ...5n1h2txyewy\StartMenuExperienceHost.exe N/A      |
|    0      8512    C+G   ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A      |
|    0      8860    C+G   ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A      |
|    0      9924    C+G   Insufficient Permissions                   N/A      |
|    0     11508    C+G   ...ogram Files\Mozilla Firefox\firefox.exe N/A      |
|    0     14752    C+G   ...ogram Files\Mozilla Firefox\firefox.exe N/A      |
|    0     18232    C+G   ...ogram Files\Mozilla Firefox\firefox.exe N/A      |
|    0     19740    C+G   ...R.x86\ServiceHub.ThreadedWaitDialog.exe N/A      |
+-----------------------------------------------------------------------------+

...and here is the output of nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:32:27_Pacific_Daylight_Time_2019
Cuda compilation tools, release 10.2, V10.2.89

There are bad.list and bad_label.list files present. bad.list has 625 entries, bad_label.list is much shorter:

[data\obj\coco\000000078274.txt "Wrong annotation: class_id = 62. But class_id should be [from 0 to 0]" 
data\obj\coco\000000122667.txt "Wrong annotation: class_id = 38. But class_id should be [from 0 to 0]" 
data\obj\coco\000000507551.txt "Wrong annotation: class_id = 35. But class_id should be [from 0 to 0]" 
data\obj\coco\000000116064.txt "Wrong annotation: class_id = 57. But class_id should be [from 0 to 0]" 
data\obj\coco\000000466734.txt "Wrong annotation: class_id = 67. But class_id should be [from 0 to 0]" 
data\obj\coco\000000220889.txt "Wrong annotation: class_id = 50. But class_id should be [from 0 to 0]" 
data\obj\coco\000000281134.txt "Wrong annotation: class_id = 14. But class_id should be [from 0 to 0]" 
data\obj\coco\000000147753.txt "Wrong annotation: class_id = 44. But class_id should be [from 0 to 0]" 
data\obj\coco\000000006562.txt "Wrong annotation: class_id = 34. But class_id should be [from 0 to 0]" 
data\obj\coco\000000539173.txt "Wrong annotation: class_id = 22. But class_id should be [from 0 to 0]" 
data\obj\coco\000000269116.txt "Wrong annotation: class_id = 41. But class_id should be [from 0 to 0]" 
data\obj\coco\000000242744.txt "Wrong annotation: class_id = 24. But class_id should be [from 0 to 0]" 
data\obj\coco\000000529886.txt "Wrong annotation: class_id = 54. But class_id should be [from 0 to 0]" 
data\obj\coco\000000311822.txt "Wrong annotation: class_id = 56. But class_id should be [from 0 to 0]" 
data\obj\coco\000000091394.txt "Wrong annotation: class_id = 35. But class_id should be [from 0 to 0]" 
data\obj\coco\000000389490.txt "Wrong annotation: class_id = 35. But class_id should be [from 0 to 0]" 
data\obj\coco\000000381854.txt "Wrong annotation: class_id = 15. But class_id should be [from 0 to 0]" 
data\obj\coco\000000548014.txt "Wrong annotation: class_id = 27. But class_id should be [from 0 to 0]" 
data\obj\coco\000000505979.txt "Wrong annotation: class_id = 16. But class_id should be [from 0 to 0]" 
data\obj\coco\000000353991.txt "Wrong annotation: class_id = 38. But class_id should be [from 0 to 0]" 
data\obj\coco\000000025193.txt "Wrong annotation: class_id = 66. But class_id should be [from 0 to 0]" 
data\obj\coco\000000411703.txt "Wrong annotation: class_id = 51. But class_id should be [from 0 to 0]" 
data\obj\coco\000000555263.txt "Wrong annotation: class_id = 63. But class_id should be [from 0 to 0]" 
data\obj\coco\000000412599.txt "Wrong annotation: class_id = 61. But class_id should be [from 0 to 0]" 
data\obj\coco\000000427118.txt "Wrong annotation: class_id = 30. But class_id should be [from 0 to 0]" 
data\obj\coco\000000237720.txt "Wrong annotation: class_id = 77. But class_id should be [from 0 to 0]" 
data\obj\coco\000000028019.txt "Wrong annotation: class_id = 22. But class_id should be [from 0 to 0]" ]

I notice these files are mentioned at the end of #4119 so I will look at that issue again and try to determine next steps - I guess removing those files and restarting?

I'm not sure what to try after that if that doesn't solve the issue. I'm crossing my fingers at this point that I will be able to train against my own dataset under 10,500 iterations, but it's quite likely I'll need more and I have a deadline of January 11th of having the network trained and inference working. The VS object dumps don't reveal anything special and the call stacks for the lines of code mentioned in the errors don't lead to any insights either:

cuda_free():

darknet.exe!cuda_free(float * x_gpu) Line 423
    at C:\Program Files (x86)\Darknet-53-AlexeyAB\src\dark_cuda.c(423)
darknet.exe!resize_network(network * net, int w, int h) Line 492
    at C:\Program Files (x86)\Darknet-53-AlexeyAB\src\network.c(492)
darknet.exe!train_detector(char * datacfg, char * cfgfile, char * weightfile, int * gpus, int ngpus, int clear, int dont_show, int calc_map, int mjpeg_port, int show_imgs) Line 193
    at C:\Program Files (x86)\Darknet-53-AlexeyAB\src\detector.c(193)
darknet.exe!run_detector(int argc, char * * argv) Line 1514
    at C:\Program Files (x86)\Darknet-53-AlexeyAB\src\detector.c(1514)
darknet.exe!main(int argc, char * * argv) Line 474
    at C:\Program Files (x86)\Darknet-53-AlexeyAB\src\darknet.c(474)
[External Code]

check_error():

darknet.exe!check_error(cudaError status) Line 63
    at C:\Program Files (x86)\Darknet-53-AlexeyAB\src\dark_cuda.c(63)
darknet.exe!check_error_extended(cudaError status, const char * file, int line, const char * date_time) Line 93
    at C:\Program Files (x86)\Darknet-53-AlexeyAB\src\dark_cuda.c(93)
darknet.exe!cuda_free(float * x_gpu) Line 427
    at C:\Program Files (x86)\Darknet-53-AlexeyAB\src\dark_cuda.c(427)
darknet.exe!resize_yolo_layer(layer * l, int w, int h) Line 109
    at C:\Program Files (x86)\Darknet-53-AlexeyAB\src\yolo_layer.c(109)
darknet.exe!resize_network(network * net, int w, int h) Line 531
    at C:\Program Files (x86)\Darknet-53-AlexeyAB\src\network.c(531)
darknet.exe!train_detector(char * datacfg, char * cfgfile, char * weightfile, int * gpus, int ngpus, int clear, int dont_show, int calc_map, int mjpeg_port, int show_imgs) Line 193
    at C:\Program Files (x86)\Darknet-53-AlexeyAB\src\detector.c(193)
darknet.exe!run_detector(int argc, char * * argv) Line 1514
    at C:\Program Files (x86)\Darknet-53-AlexeyAB\src\detector.c(1514)
darknet.exe!main(int argc, char * * argv) Line 474
    at C:\Program Files (x86)\Darknet-53-AlexeyAB\src\darknet.c(474)
[External Code]

Links including a few screenshots:

nvidia-smi.txt nvcc --version.txt application-exit_1.txt application-exit_2.txt check_error-stack_3.txt cuda_free_stack.txt

So I wrote a batch script to remove all of the images and associated labels from the dataset and attempted to re-train, being careful to edit the training document as well. Unfortunately, this now results in execution halting every time after the network is initialized and runs for a few seconds, while adding new files to the bad.list. It seems to be slowly corrupting the entire dataset. I'm going to start over re-downloading the entire dataset directly to the computer and unzipping it directly in the destination folder. I'm using a Java project called cocotoyolo.jar to generate the labels: [(https://bitbucket.org/yymoto/coco-to-yolo/src/master/)]

In the meantime I'm attempting to train COCO on another machine with a pair of Titan Blacks. Unfortunately, two Titan Blacks is much slower than 4 980 Ti's, so it will take about 20 hours to get to the same number of iterations.

@ryamldess Hi,

After turning on the BIOS flag for 4G addressing and reinstalling graphics drivers, everything seemed to work great.

What do you mean? What is the flag for 4G addressing?

[data\obj\coco\000000078274.txt "Wrong annotation: class_id = 62. But class_id should be [from 0 to 0]"

It means that you didn't set classes= in each of [yolo] layer as described there: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

try to allocate additional workspace_size = 52.43 MB CUDA allocate done! Loaded: 3.250000 seconds Try to set subdivisions=64 in your cfg-file. CUDA status Error: file: C:\Program Files (x86)\Darknet-53-AlexeyAB\src\dark_cuda.c : cuda_make_array() : line: 357 : build time: Dec 14 2019 - 21:41:09 CUDA Error: out of memory

Did you try to set subdivisions=64 in your cfg-file as described in this error message?

You shouldn't compile OpenCV with CUDA. Because OpenCV is used only for data augmentation and Darknet uses only CPU-functions from OpenCV. Just use standard OpenCV for Windows that you can download from: https://opencv.org/releases/
If you use Windows, you don't use Makefile, so you shouldn't change Makefile. Use 1 of these 3 ways for compilation Darknet - I recommend to use Cmake + MSVS2019: https://github.com/AlexeyAB/darknet#how-to-compile-on-windows-using-cmake-gui
Did you compile Darknet by using default /build/darknet/darknet.sln or by using Cmake + MSVS2019?
Did you compile Darknet with cuDNN library? And what cuDNN library do you use? You should use cuDNN - training will be 2x-3x faster: https://github.com/AlexeyAB/darknet#requirements
Can you attach your cfg-file in zip?
What command do you use for training?

What do you mean? What is the flag for 4G addressing?

4G = addressing above the 4 gigabyte address space, which is required for 64-bit GPUs. Maxwell has a 64-bit architecture, so I had to change this setting, otherwise this machine boots directly into BIOS with a "PCIe out of resources" BIOS error. Titan Blacks apparently use 32-bit memory addressing, so it hadn't been an issue before the upgrade.

It means that you didn't set classes= in each of [yolo] layer as described there: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

I'm using a copy of the default yolov3.cfg which I have renamed coco-yolov3.cfg, in which all classes are set to 80. I changed batches and subdivisions to 64. I'll post it when I'm back on that machine, I'm responding from a different computer atm. While my eventual goal is to train my own dataset, to reiterate I am currently attempting to train the MS COCO dataset as a test to root out issues like this one, and general workflow, in advance and to gauge roughly how long training my own dataset will take, which will probably have about 54 classes. But right now I'm training COCO, so I'm using the default 80 classes.

Did you try to set subdivisions=64 in your cfg-file as described in this error message?

Yes, as stated above, I'm using 64 for both batches and subdivisions.

You shouldn't compile OpenCV with CUDA. Because OpenCV is used only for data augmentation and Darknet uses only CPU-functions from OpenCV. Just use standard OpenCV for Windows that you can download from: https://opencv.org/releases/

Doesn't this mean I will train against the CPU? I built OpenCV without CUDA in an earlier attempt and GPU acceleration didn't work - training was extraordinarily slow just on the CPU. It took 3 days to train a single custom class; I don't have that kind of time. I have lots of GPUS, so I'm going to use them.

If you use Windows, you don't use Makefile, so you shouldn't change Makefile. Use 1 of these 3 ways for compilation Darknet - I recommend to use Cmake + MSVS2019: https://github.com/AlexeyAB/darknet#how-to-compile-on-windows-using-cmake-gui

So Cmake-GUI doesn't use the Makefile? I didn't realize that. I used Cmake-GUI and VS Community 2019 as stated in the initial post.

Did you compile Darknet by using default /build/darknet/darknet.sln or by using Cmake + MSVS2019?

I used the solution generated by Cmake-GUI and VS 2019.

Did you compile Darknet with cuDNN library? And what cuDNN library do you use? You should use cuDNN - training will be 2x-3x faster: https://github.com/AlexeyAB/darknet#requirements

I tried compiling with cuDNN, but Cmake can't find cuDNN no matter what I try and disables it during configuration. I copied the appropriate files, added the cuDNN folders to my PATH and defined CUDA_PATH, but during configuration Cmake-GUI finds cuDNN and outputs the version, but then says it can't find it.

Can you attach your cfg-file in zip?

Yes; will do shortly.

What command do you use for training?

I'll post that with my config when I'm back at that other machine in a few minutes.

@ryamldess

It means that you didn't set classes= in each of [yolo] layer as described there: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects I'm using a copy of the default yolov3.cfg which I have renamed coco-yolov3.cfg

But this error message says that you use classes=1

Doesn't this mean I will train against the CPU? I built OpenCV without CUDA in an earlier attempt and GPU acceleration didn't work - training was extraordinarily slow just on the CPU. It took 3 days to train a single custom class; I don't have that kind of time. I have lots of GPUS, so I'm going to use them.

No.

Your training is slow because you didn't compile Darknet with GPU, cuDNN and OpenCV. Or because you use slow HDD disk.

So Cmake-GUI doesn't use the Makefile? I didn't realize that. I used Cmake-GUI and VS Community 2019 as stated in the initial post.

Cmake doesn't use your Makefile at all. Also Cmake generates its own Makefile only on Linux. Cmake generates darknet.sln on Windows instead of Makefile.

but during configuration Cmake-GUI finds cuDNN and outputs the version, but then says it can't find it.

Can you show screenshot?

github doesn't support .cfg files, so I'm just going to paste it here as a code block below. The command I use is:

darknet detector train coco.data cfg/coco-yolov3.cfg darknet53.conv.74 -gpus 0,1,2,3

Config file:

[net]
# Testing
batch=64
subdivisions=64
# Training
# batch=64
# subdivisions=16
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
#max_batches = 500200
max_batches = 64000
policy=steps
steps=400000,450000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

# Downsample

[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=32
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=128
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=256
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=512
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

######################

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=255
activation=linear

[yolo]
mask = 6,7,8
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=80
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 61

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=255
activation=linear

[yolo]
mask = 3,4,5
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=80
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 36

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=255
activation=linear

[yolo]
mask = 0,1,2
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=80
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

Here are my CUDA settings for building OpenCV in Cmake-GUI; so you're saying these should all be unchecked?

When I do that, GPU utilization on all GPUs is a few percent, and CPU utilization is 100%.

When I do that, GPU utilization on all GPUs is a few percent, and CPU utilization is 100%.

You are doing something wrong.

I tried compiling with cuDNN, but Cmake can't find cuDNN no matter what I try and disables it during configuration. I copied the appropriate files, added the cuDNN folders to my PATH and defined CUDA_PATH, but during configuration Cmake-GUI finds cuDNN and outputs the version, but then says it can't find it.

You must compile Darknet with cuDNN. Also show screenshot where it says it can't find it

Here are my CUDA settings for building OpenCV in Cmake-GUI; so you're saying these should all be unchecked?

Download default OpenCV from the link that I gave.

github doesn't support .cfg files, so I'm just going to paste it here as a code block below. The command I use is:

you can archive the file in a zip archive and attach zip. https://github.com/AlexeyAB/darknet/issues/4533#issuecomment-566742394

Can you attach your cfg-file in zip?

Set grouped and show screenshot in the form which I gave: https://github.com/AlexeyAB/darknet/issues/4533#issuecomment-566756370

So I just finished setting up this machine (the one with the 980 Ti's) to train on the 2014 dataset, which is a little larger than half the size of the 2017 dataset, and I've switched back to a Release build of darknet, as the Debug version didn't tell me anything the CLI didn't; so it should reach the same number of iterations in about 6 hours.

As I mentioned, I'm training the 2017 dataset on another machine with a pair of Titan Blacks; and I am copying the 2017 dataset to a third machine as well, also with 2x Titan Blacks to do the same. Both of those machines should reach ~10,500 iterations in a little under a day.

Meanwhile you've seriously confused me regarding OpenCV so I guess I need to go back to school on that because every other blog post or article I've read online says to compile it against CUDA in order to enable multi-GPU acceleration, and that's the only thing that has worked for me. In addition to which, I can't get Cmake-GUI to recognize and enable cuDNN, even though it can find it and identify a valid version. So it's possible that's the issue. If the 4x980 Ti machine successfully trains, that will mean there is some issue with my copy of the 2017 dataset; if the other machines train successfully, that will mean there is some issue with the 4x980 Ti machine. If they all fail, then I will look at the OpenCV compilation again. They are all using the same version of OpenCV I compiled on the 4x980Ti machine earlier when it had 4x Titan Blacks and deployed to various other machines; however I have since recompiled OpenCV on the 980Ti machine itself attempting to fix this issue.

Training a single custom class of my own succeeded on 6 different machines running a variety of cards - 4x Titan Blacks, 4x 980 Tis, 2x Titan Blacks, 2x 1060's, 1x 980, 1x 970, and 1x 780 - in about 1300 iterations; it is only on the 2017 COCO dataset past 10,500 iterations where this issue is occurring.

Another test I can try is to train against VOC.

A 4th machine, which is my main workstation, has a pair of 1060's on it. I was never able to get OpenCV to compile on that machine, which still had Windows 7 on it (all of the other machines have Windows 10). I finally upgraded the 2x1060 machine to Windows 10 today, so I may attempt a from-scratch compilation of OpenCV on that machine. I only have one Linux box running CentOS 7, which I use as a local source code repo. Unfortunately, being a server, it has a $30 GPU in it. However, I own a copy of VMWare Workstation, so I can also attempt to get everything working in a VM running some flavor of Linux if it comes to that, but I don't think that should be necessary. However, before I do any of that I'm using the 2x1060 machine to capture the data for my custom classes, which I need to get done soon, so it may be a couple of days before I can make the re-compile attempt of OpenCV on that machine.

because every other blog post or article I've read online says to compile it against CUDA in order to enable multi-GPU acceleration

Can you give a link & quote?

owever, I own a copy of VMWare Workstation

You should use VMWare ESXi if you want to use GPU on Virtual Machine.

I was never able to get OpenCV to compile on that machine, which still had Windows 7 on it (all of the other machines have Windows 10).

You shouldn't compile OpenCV. Download already compiled OpenCV from the link that I gave.

https://jamesbowley.co.uk/build-opencv-4-0-0-with-cuda-10-0-and-intel-mkl-tbb-in-windows/

It's right in the title:

"Accelerating OpenCV 4 – build with CUDA 10.0, Intel MKL + TBB and python bindings in Windows"

https://www.pyimagesearch.com/2016/07/11/compiling-opencv-with-cuda-support/

Again, in the title:

"Compiling OpenCV with CUDA support"

https://pterneas.com/2018/11/02/opencv-cuda/

"HOW TO BUILD OPENCV FOR WINDOWS WITH CUDA"

These are all as a result of a google search for 'building opencv with gpu support'.

Anyway, there might also be an issue with the wonky location where I'm trying to compile things. I have OpenCV under Program Files, which is probably a bad idea. I think there might be a write permission problem to cmake/FindCUDNN.make, as it wouldn't save even when opened with admin permissions. I went ahead and compiled on the 2x1060 machine anyway since I'm not ready to capture data yet and have an errand to run, and it seems to be compiling now since the Win10 upgrade. I'll run a training test on the 2x1060 machine as well and if it fails I'll move OpenCV to a more conventional location and try again.

Googling around it seems a lot of people on a variety of OSes have issues getting OpenCV to find cuDNN.

If you have an alternative guide to building OpenCV that you recommend, I'd love to have a link; that first one from James Bowley was the first one that worked for me and allowed me to get GPU acceleration working.

@ryamldess In these links there is nothing about Darknet. Darknet uses its own CUDA-functions or cuDNN library insrtead of OpenCV for neural netrwork acceleration. I implemented only data augmentation by using OpenCV on CPU.

Use this link which I gave earlier: https://opencv.org/releases/ Just download and unpack, you shouldn't compile it: https://sourceforge.net/projects/opencvlibrary/files/4.1.2/opencv-4.1.2-vc14_vc15.exe/download

You should compile Darknet with cuDNN, not OpenCV with CUDA/cuDNN.

Ah, okay, I see my mistake now. I downloaded the pre-compiled version of OpenCV to which you directed me, changed my environment vars and recompiled darknet. It's now running on this machine (the 2x1060 workstation) with GPU acceleration. I'll post back an update later; they will probably fail, but I'm leaving the other tests running on other workstations just to see how they turn out. I read most of your readme on github, but somehow didn't grok from that that CUDA and cuDNN were built into darknet. The actual steps are much simpler than I imagined; I thought I had to compile OpenCV with CUDA and cuDNN and then compile darknet.

Out of curiosity, is that version of opencv you linked me to built with MKL and TBB? I noticed a big performance improvement when I added those in.

So just for confirmation, the 4x980 Ti machine again failed with a CUDA out of memory error around iteration 9900, as expected. I'll fix that machine some time tomorrow, as it is the fastest machine I have available for training and try again. The 2x1060 machine is training on the correct version of OpenCV and darknet, I'll let that run overnight which should get it well past 10,000 iterations.

Looking back at the readme, it doesn't specify whether to build OpenCV against CUDA or cuDNN under the information for requirements; it also never mentions that they are built into darknet. In retrospect, it's clear from the Cmake variables that this is the case, but you might want to add that information just to make it clear. Your documentation is far superior to what is available on the original pjreddie repo you forked from or the darknet site, but this particular detail is vague and could save people like me who are jumping into the deep end a lot of time with an additional sentence or two. In my case, I have an understanding of CNNs from classes I've taken, but the exercises were all in python using a Jupyter notebook and didn't utilize Yolo, OpenCV or CUDA, of which I have no prior knowledge.

Just a suggestion; thanks for putting the fork together. I might put together a small quick start guide of my own for Windows users.

Okay, so interestingly, one of the 2x Titan Black machines, using the 2017 COCO training set with my incorrectly compiled version of OpenCV has gone past the usual fail point of ~10,000 iterations on the faster machine and is still slowly trudging along - I started it at 9:04 AM PST yesterday and it's at about 14,500 iterations now.

This would seem to indicate that there could possibly be another root cause at play on the 4x 980 Ti machine, but I won't know until I deploy the correct version of OpenCV you directed me to download to that machine and try again. It kind of makes sense that this wouldn't really be an issue though, since darknet as you said uses OpenCV only minimally and is compiled with CUDA and cuDNN; so any CUDA or cuDNN references in OpenCV would likely just be ignored. That said, perhaps there might be duplicate calls in some cases - I'm not familiar enough with either codebase to know for sure.

Here's the current state of play:

4x 980 Ti workstation: 2014 COCO Wrongly compiled OpenCV Failed @ ~10,000 iterations again

2x Titan Black workstation 1: 2017 COCO Wrongly compiled OpenCV Still running, ~14,500 iterations so far

2x 1060 workstation: 2017 COCO Correct OpenCV Running, currently at ~9,500 iterations

2x Titan Black workstation 2: Files transferring, not running

Next I'll deploy the correct OpenCV binary to the 4x 980 Ti machine and try again; if it fails, then the OpenCV binary is not the likely root cause.

Okay, I've started training on the 4x 980 Ti machine with the new OpenCV binary, so new update is:

4x 980 Ti workstation: 2014 COCO Correct OpenCV Started training

2x Titan Black workstation 1: 2017 COCO Wrongly compiled OpenCV Still running, ~15,500 iterations so far

2x 1060 workstation: 2017 COCO Correct OpenCV Passed the ~10,500 iteration threshold and still running

2x Titan Black workstation 2: Files transferring, not running

I'll post again in 8 hours or so when the 4x 980 Ti box passes the 10,500 iteration threshold or if any box fails.

if something will fail, then show the last 10 lines from the files bad.list and bad_label.list, and the error message.

So good news and bad news. The good news is that everything seems to be running fine on my other workstations with either version of OpenCV - either the one I mistakenly compiled with CUDA and cuDNN or the correct official download.

The bad news is that even with the correct binary, training failed again on the 4x 980Ti workstation, which is my fastest machine for training, which means something else is the culprit. I've attached a screenshot below with the last 40 lines or so of output from the CLI.

The next things I will try are to remove the other version of OpenCV on this machine just to disambiguate, and recompile darknet on it, although my compilation of darknet hasn't ever really changed. It's once again emanating from line 423 of dark_cuda.c in cuda_free(), probably with a status of 2, but that doesn't really tell us anything about the cause. It's coming from CUDA istelf, which is not compiled with debug symbols, so there's no way to get a useful stack trace out of it.

What is even more mystifying is that this time there is no bad.list file generated - for whatever reason the 2014 dataset never seems to have issues.

I will leave the 2x 1060 machine running for a while, it's somewhere around 15,000 iterations. One of the 2x Titan machines has gotten to about 21,000 iterations, I think it's probably fine so I'm going to shut it down and focus back on the problem workstation that I would like to get working.

It shouldn't be an issue, but just to throw it out there, as I mentioned I bought these cards secondhand through eBay - 2 of them are Dell sub-variants, one is reference and one is an HP sub-variant. However they all have identical specs and look identical as well. Like I said I don't think it's an issue but I'm putting it out there in case there's any remote possibility it might be relevant. That does actually give me an idea: I'll try training on just the first two cards, which are the 2 Dell sub-variants. Ultimately I need all 4 cards working, but if that causes different behavior that will tell us something at least.

Here is a screenshot of the 2x Titan Black machine that doesn't seem to experience the issue training the 2017 COCO dataset even with the incorrectly compiled version of OpenCV for reference. I stopped training around 21,000 iterations:

Here is a screenshot of the 2x 1060 machine, training 2017 COCO with the correct version of OpenCV, and also seemingly fine. I stopped training around 17,000 iterations:

There is no correctly or incorrectly compiled OpenCV. This is not related to OpenCV.

You compiled Darknet with which parameters and libraries?
What versions of CUDA, cuDNN, OpenCV, nVidia driver do you use?
Try to train by using only 1 GPU, will you get this error?
Try to train with only 3 GPUs out of 4, excluding each GPU in turn. Perhaps there is a hardware/driver/bios bug in one of the GPU.

So I re-compiled darknet last night and noticed that although I had switched to the correct version of OpenCV, Cmake was still pointing to the incorrect location. Unfortunately, after re-compiling and running again overnight, the error is still occurring:

This is not related to OpenCV.

I couldn't agree more, but I had to verify that in order to disambiguate it from the scope of troubleshooting.

I will now try to run training with various configurations of GPUs. I'm going to start with the two Dell subvariants just on a hunch. They definitely have different BIOS versions, though I'm not sure whether or to what degree that what be an issue.

Here's a Speccy dump of my cards. You see they all have different BIOS versions, including the two Dell Subvendor cards; in fact, the cards closest to each other in BIOS version are the 2nd Dell card and the reference card (ignore the amount of memory, it's a known bug in Speccy - actual memory on all cards is 6144 MB as verified in GPU-Z):

        NVIDIA GeForce GTX 980 Ti
            Manufacturer    NVIDIA
            Model   GeForce GTX 980 Ti
            Device ID   10DE-17C8
            Revision    A2
            Subvendor   Dell (1028)
            Current Performance Level   Level 0
            Current GPU Clock   999 MHz
            Current Memory Clock    3304 MHz
            Current Shader Clock    3304 MHz
            Voltage 0.993 V
            Current Performance Level   Level 0
            Current GPU Clock   1101 MHz
            Current Memory Clock    3304 MHz
            Current Shader Clock    3304 MHz
            Voltage 0.993 V
            Technology  28 nm
            Bus Interface   PCI Express x16
            Temperature 83 °C
            SLI Disabled
            Driver version  26.21.14.4141
            BIOS Version    84.00.4b.00.06
            Physical Memory 2047 MB
            Virtual Memory  2048 MB
                Count of performance levels : 1
                        Level 1 - "Perf Level 0"
                            GPU Clock   1012 MHz
                            Shader Clock    3304 MHz
                Count of performance levels : 1
                        Level 1 - "Perf Level 0"
                            GPU Clock   1139 MHz
                            Shader Clock    3304 MHz
        NVIDIA GeForce GTX 980 Ti
            Manufacturer    NVIDIA
            Model   GeForce GTX 980 Ti
            Device ID   10DE-17C8
            Revision    A2
            Subvendor   Dell (1028)
            Current Performance Level   Level 0
            Current GPU Clock   999 MHz
            Current Memory Clock    3304 MHz
            Current Shader Clock    3304 MHz
            Voltage 0.993 V
            Current Performance Level   Level 0
            Current GPU Clock   1101 MHz
            Current Memory Clock    3304 MHz
            Current Shader Clock    3304 MHz
            Voltage 0.993 V
            Technology  28 nm
            Bus Interface   PCI Express x16
            Temperature 83 °C
            SLI Disabled
            Driver version  26.21.14.4141
            BIOS Version    84.00.32.00.02
            Physical Memory 2047 MB
            Virtual Memory  2048 MB
                Count of performance levels : 1
                        Level 1 - "Perf Level 0"
                            GPU Clock   1012 MHz
                            Shader Clock    3304 MHz
                Count of performance levels : 1
                        Level 1 - "Perf Level 0"
                            GPU Clock   1139 MHz
                            Shader Clock    3304 MHz
        ASPEED Technology ASPEED Graphics Family(WDDM)
            Manufacturer    ASPEED Technology
            Model   ASPEED Graphics Family(WDDM)
            Device ID   1A03-2000
            Revision    31
            Subvendor   ASUStek Computer Inc (1043)
            Current Performance Level   Level 0
            Voltage 0.993 V
            Driver version  9.0.10.106
                Count of performance levels : 1
                    Level 1 - "Perf Level 0"
        NVIDIA GeForce GTX 980 Ti
            Manufacturer    NVIDIA
            Model   GeForce GTX 980 Ti
            Device ID   10DE-17C8
            Revision    A2
            Subvendor   NVIDIA (10DE)
            Current Performance Level   Level 0
            Current GPU Clock   999 MHz
            Current Memory Clock    3304 MHz
            Current Shader Clock    3304 MHz
            Voltage 0.993 V
            Current Performance Level   Level 0
            Current GPU Clock   1101 MHz
            Current Memory Clock    3304 MHz
            Current Shader Clock    3304 MHz
            Voltage 0.993 V
            Technology  28 nm
            Bus Interface   PCI Express x16
            Temperature 83 °C
            SLI Disabled
            Driver version  26.21.14.4141
            BIOS Version    84.00.32.00.01
            Physical Memory 2047 MB
            Virtual Memory  2048 MB
                Count of performance levels : 1
                        Level 1 - "Perf Level 0"
                            GPU Clock   1012 MHz
                            Shader Clock    3304 MHz
                Count of performance levels : 1
                        Level 1 - "Perf Level 0"
                            GPU Clock   1139 MHz
                            Shader Clock    3304 MHz
        NVIDIA GeForce GTX 980 Ti
            Manufacturer    NVIDIA
            Model   GeForce GTX 980 Ti
            Device ID   10DE-17C8
            Revision    A2
            Subvendor   HP (103C)
            Current Performance Level   Level 0
            Current GPU Clock   999 MHz
            Current Memory Clock    3304 MHz
            Current Shader Clock    3304 MHz
            Voltage 0.993 V
            Current Performance Level   Level 0
            Current GPU Clock   1101 MHz
            Current Memory Clock    3304 MHz
            Current Shader Clock    3304 MHz
            Voltage 0.993 V
            Technology  28 nm
            Bus Interface   PCI Express x16
            Temperature 75 °C
            SLI Disabled
            Driver version  26.21.14.4141
            BIOS Version    84.00.4b.00.05
            Physical Memory 2047 MB
            Virtual Memory  2048 MB
                Count of performance levels : 1
                        Level 1 - "Perf Level 0"
                            GPU Clock   1012 MHz
                            Shader Clock    3304 MHz
                Count of performance levels : 1
                        Level 1 - "Perf Level 0"
                            GPU Clock   1139 MHz
                            Shader Clock    3304 MHz

You compiled Darknet with which parameters and libraries?

What versions of CUDA, cuDNN, OpenCV, nVidia driver do you use?

As posted above, original darknet parameters were as follows:

Most recent parameters were:

CUDA: 10.2 cuDNN: 10.2/7.6.5.32 OpenCV: 4.1.2 Driver: 441.41... 441.66 was released a few days ago, but all of my machines have 441.41 and this is the only machine experiencing the issue

All of my systems are using the above versions. I have retired the 2x 1060 machine and first 2x Titan Black machine from testing, as they seem fine; my other 2x Titan Black machine finally finished transferring files and is training against COCO 2017 - the 980 Ti machine is training with its first two cards (the Dell subvendor variants) on the 2014 dataset. I've downloaded NVFlash to update the firmware of all cards after I either fail or succeed at isolating the issue on a particular card.

Okay so the first two cards seem to work fine, stopping it ~16,000:

I'm now going to try with 3 cards and see if I can either isolate it to a single card or if it occurs any time I use 4 cards. It's possible that CUDA is having issues with the 4G addressing (i.e., 64-bit) that I mentioned earlier, which doesn't really make sense as I downloaded the 64-bit versions of all software; and as I mentioned these cards won't, and shouldn't work without that setting enabled. Anything over Maxwell probably has to use 4G addressing anyway. The motherboard is an X99 architecture server board designed to hold 4 GPUs, so it should have enough PCIe bandwidth. Power wise it's plugged into a 3000W UPS on a 30A/3600W outlet on its own circuit, so there should be plenty of juice.

After I determine whether it's a specific 3rd/4th card or any 4 cards, I'll flash everything with the same BIOS and try again with all 4.

For reference, here is a screenshot of the second 2x Titan Black machine, which I'm stopping after ~11,000 iterations, as it seems fine:

2x_titan_black-success

This is just for follow-through - runs fine on 3 out of 4 machines with architectures both above (1060) and below (Titan Black) the 980 Ti, which I think means we can rule out other hardware differences. Also, one of those 3 other machines is X99, the other two are X79 and all have different processors.

Running with cards 0,1,2 is fine, stopped it after 11,000 iterations:

I'll now try training with 0,1,3 overnight... it will be interesting to see if the 4th card is the culprit or if it just fails in general with all 4 cards running. If it's the former, then maybe there is some sort of firmware issue; not sure why it would fail at such a specific point though. If it's the latter, maybe it's a firmware compatibility issue across the cards, or possibly there is a PCIe bandwidth problem on this mobo, but that wouldn't make sense as it is specifically designed to be able to run 4-way SLI. Either way, at least I will have a direction to explore.

Well, this is a bit of a bummer, but my 0,1,3 test seems to confirm that any 4 cards will throw the CUDA out of memory error, since these 3 cards run fine past 10,500 iterations:

What is particularly strange is that memory as reported by MSI Afterburner never exceeds a couple of GB. I wonder if Afterburner is prone to a similar defect as Speccy that makes it under-report VRAM. I have a similar issue on one of the Titan Black machines with a 4960X processor - I can only run 2/3 GPUs on that machine, because if I run 3 the third card experiences a buffer overrun that flows over onto its main drive, which is a PCIe SSD in the last slot, resulting in the odd effect that the C: drive ramps up to 100% usage and the system becomes unresponsive.

The machine that is the basis of this issue on the other hand has no problem with training on 3 cards, and no issue with 4 Titan Blacks. This is a bit of a bummer, as 4 980 Ti's are about 20% faster than 4 Titan Blacks - but 3 980 Ti's are about 10 minutes slower per 1,000 iterations than 4 Titan Blacks.

I think it's safe to say this is a CUDA bug or possibly a hardware issue at this point, but I'm going to keep this open for a little bit in case anyone reading has any suggestions and to post my last few steps to address it in case it helps anyone with similar issues.

The only things I can think to try next are: 1) updating firmware on all cards and updating drivers to the slightly newer one, 441.66; and 2) potentially scaling back the image size in my config. Even though software reports that the memory is fine, real life says it is not, so I need to find ways to reduce memory usage and see if that helps.

because if I run 3 the third card experiences a buffer overrun that flows over onto its main drive, which is a PCIe SSD in the last slot, resulting in the odd effect that the C: drive ramps up to 100% usage and the system becomes unresponsive.

It seems that there is some bug in hardware (PCIe-controller on CPU or GPU). Or some bug in drivers.

Power wise it's plugged into a 3000W UPS on a 30A/3600W outlet on its own circuit, so there should be plenty of juice.

How many watts in your power supply, and which one do you use?

potentially scaling back the image size in my config. Even though software reports that the memory is fine, real life says it is not, so I need to find ways to reduce memory usage and see if that helps.

Size of which images do you want to scale?

Every image will be scaled to the network size (width= height= in cfg-file) automatically by using OpenCV on CPU. So you shouldn't do it.
If you will reduce the network size (in cfg), then it will reduce accuracy of neural network too.

It seems that there is some bug in hardware (PCIe-controller on CPU or GPU). Or some bug in drivers.

Probably; that's on one of the machines currently running Titan Blacks though.

How many watts in your power supply, and which one do you use?

The dual-Xeon/4x 980 Ti workstation having the issue was originally built with an X79 board back before 1300, 1500 and 1600W PSUs were available, so it has 2x Corsair 1200 AXi's. The GPUs have one PSU and all of the other components, including the mobo, are on the other PSU (It now has X99 architecture mobo, CPUs and RAM). The UPS shows that the whole system is drawing about 1200W when it is at full load during training with 4 GPUs, so I think power shouldn't be an issue.

I've updated the mobo BIOS and GPU drivers; I downloaded new firmware for the mobo as well, but it has no documentation and I couldn't find a way to flash it.

I looked into flashing the GPUs to update them all to the same BIOS, but this seems ill advised. For starters, they have 3 different subvendors, and therefore I would need to update them to 3 new BIOS ROMs - two Dell, one Nvidia and one HP. So it is literally impossible to put them all on the same BIOS. Moreover, GPU ROM hacking seems to have gone dark around 2012 for some reason. All of the software and forum posts about it date no later than that. I downloaded NiBiTor just to view the BIOS files, and it's so out of date it only recognizes up to 800 series GTX cards; and that's the latest version. GPU-Z and NVFlash both seem to still work as advertised. Moreover, no manufacturer seems to offer BIOS ROMs, they are only available from 3rd parties, which makes me suspicious. It seems like you would need to start by writing a new ROM editor from scratch, as it doesn't seem any good ones exist for cards after 2012. I could devote time to it, but it's a huge rabbit hole I don't think is worth my time at the moment, as any 4 cards cause the error and any 3 work fine... so it doesn't appear to be a BIOS issue or that any of the cards are fake or something like that anyway. It would be a Herculean effort for minimal or zero return.

So I'll try training again with the latest drivers and BIOS.

After that, the only thing left is to potentially look into power - first verifying that it is wired the way I indicated in above comments. I have the machine strangely hooked up at the moment, so that is also worthy of exploration. I have an 1875W outdoor extension cable running to the UPS, because it's in a different room on a different floor; the machine is usually powered from a 1500A/900W unit, but that is obviously insufficient under these loads (I switched to the 3000VA/3000W unit when the smaller UPS started beeping very loudly and switched to battery when darknet started up with 4 cards). That cord is fine - 1875W is more than enough - but it terminates in a y-splitter cable that I believe is also rated for 1875W. Each branch of the Y goes to one of the two PSUs. What I don't know is if the splitter can only handle half of 1875 per branch, or if each branch can handle whatever the other branch is not pulling up to 1875W. If it's the former, then hypothetically each PSU can only get 937.5W. That should still be enough based on what the UPS says it's pulling minus what it is pulling at idle - but these are all things to investigate. I don't think power is the issue, but it's one of the few things left to try.

While this machine worked fine in my smaller test with my own custom class on 1300 iterations with 4 Titan Blacks, I never conducted the MS COCO test on this machine with 4 Titan Blacks. It's possible that this machine can only train with 3 cards owing to the same as yet still unidentified root cause. It's possible that upgrading to a single 1600W PSU would resolve the issue. I may or may not do that (I kind of want to do that just on principle, I've never liked the two-PSU configuration of this and one other machine) - but it's just as likely I will just train on 3 GPUs for now.

I think this is an issue with Power, less likely to be a bug in hardware (PCIe-controller on CPU or GPU), or some bug in Drivers.

I actually have noticed something odd in MSI Afterburner. Take a look at these screenshots:

GPU usage and FB usage. Notice how different GPU4 looks from the other GPUs?

BUS usage on GPU4 is also more active than the other GPUs.

Meanwhile, fan power and tach are significantly lower than other GPUs.

GPU voltage is all over the place while it is relatively placid for other GPUs.

However, this doesn't make sense to me, as a training run with GPUs 0,1,3 seemed fine; unless it's just that GPU4 is in the last slot, so it's the one that suffers the effects when 4 GPUs are running.

I've just confirmed that GPU4 behaves the same way even in a 0,1,3 training - what I'm not sure about is whether that means anything or if it's just the normal difference between the BIOSes on the Dell and HP subvendor variants of the card. I also notice that the core clock on GPU 4 seems to be pegged at 1190, whereas on GPUs 1 & 2 it fluctuates from 1,000 to ~1,100, and on GPU 3 is seems to be pegged at 999. It's possible someone modded the card; the question is why it is only an issue when all 4 cards are running. Perhaps it is overclocked and drawing more power than it ought to. The other question is why it always fails around the same number of iterations if that's the issue, and why the failure has to do with memory.

Well I tried flashing GPU 4 with the reference Nvidia BIOS, the same BIOS card 3 has... it did not go well. It flashed successfully, but was then no longer recognized by the system on reboot, and darknet failed with the message "CUDA error: invalid device ordinal". I re-flashed to the original ROM I exported using GPU-Z, which for some reason disabled all of my other GPUs so I had to restart in safe mode to enable them. Everything is back the way it was and running as it was again.

It looks like I'm stuck with the original HP BIOS on that GPU. It seems to have a different PCI interface than the other cards, as reported by nvflash64.

I'm attempting another 4-GPU run to see if the other changes helped anything. If not, I'll look at twiddling a few things with the power setup, and if that doesn't work I will probably just give up and train with 3 cards for now.

I ran a separate 1875W cable from the UPS so that each PSU now has a dedicated 1875W cable. That didn't seem to have any effect. Both before and after the cable change, the power draw fluctuates between 750-1000W.

I inspected the PSU cabling as well, and found that I had split the GPU power between both PSUs - I'm using two PCIe cables per GPU as the 1200 AXi didn't come with enough PCIe Y-cables. In principle, I don't think this is a horrible idea; it was a while ago when I wired these PSUs, but my thinking must have been that since the GPUs draw the most power, why not distribute the load between the PSUs?

However, I do see a slight but noticeable difference in some of the MSI Afterburner graphs now. Remember how spiky and different the usage/FB/BUS graphs were from the other GPUS? They're still different like that, but seem more similar to the other GPUs. Memory and core clock are still far more active though; fan and tach are also much lower, which I'm sure is just owing to a different fan curve in the HP BIOS on this card. Blech. HP.

It's not different enough to give me a lot of confidence in a solution, but after this I think I've done everything I can possibly do to get four of these cards running on this machine with darknet.

Since they are so cheap, I will probably purchase one more Nvidia reference 980 Ti used on eBay to hedge my bets. My living room gaming PC has a 970, so if 4 cards still won't work in this rig, the living room PC will get an upgrade to 2x 980 Ti's, and otherwise It'll either get 1 980 Ti or 2 of the old Titan Blacks, either 3 of which options would be an upgrade from a 970. Or I could go back to 4 Titan Blacks in the Xeon training rig, since 4x Titan Blacks are slightly faster than 3x 980 Ti's, which would be a huge bummer as 4x 980 Ti's are about 20% faster than than 4x TB's. In that case my other machines would have sort of a poor man's X-mas upgrade, as my main workstation just has a pair of 1060's, and the 980 Ti performs better than those as well (the 1060 also doesn't support SLI).

Fingers crossed that the power twiddling or reference GPU do the trick. I'd rather that be the case and upgrade my other machines to RTX in the future; but at least the 980 Ti's are sufficiently better than my current crop of old cards that they can serve a purpose even if I can't get this resolved. The cabling weirdness has also made me resolve to upgrade this machine, and maybe one other that has 2 AXi's as well, to a single, slightly larger PSU. Both machines were built originally around 2014, when 1200W was the largest available PSU capacity on the market. One 1200W is hypothetically enough, but I like to have a little more overhead for aging (and also if I were to load both CPUs simultaneously while the GPUs are fully loaded, it would get pretty close to 1200W); and calculators recommend slightly more capacity for these machines, around 1400W. They both use a little PCB device to wire the PSUs in serial, and I've had trivial but annoying issues with them for years. On this machine, for instance, as soon as you power on both PSU switches, the machine automatically turns on, which shouldn't happen. Not a huge deal, but not ideal.

I'll post back in 8 hours or so with results of this run.

So that didn't work; I'm just going to have to train with 3 cards for now.

Well I still have not resolved this; still waiting for some additional cheap eBay GPUs to see if one of them will rectify the situation. Another issue could be PCIe lanes hypothetically, but this board and its processors should have the bandwidth.

In any event, I've trained my custom model on 3 cards with about 4,435 images, so I thought I'd post the results just to have a little good news on this thread as we ring out the new year:

chart

It trained fairly quickly, in about 12 hours, with outstanding results in terms of average loss. In inference it is performing the required task, which is identifying unique pages of a collage journal in real time, about 70% correctly. We probably need to capture more data, but given that each spread of the journal only has about 75 images at the moment, we think that's actually pretty outstanding. We're going to try running it in the gallery space on a little Skull Canyon NUC, so I had to recompile darknet without GPU support, as it just has integrated graphics. It runs, although it's hot (cooler when I put it on a laptop cooling stand), and slow at about 0.9 FPS. That actually works great for me though, because I was trying to figure out how to slow down sampling from the JSON stream. I figure that will be the fastest way to get the labels from inference: I'll just poll localhost:8070 every so often (was planning on 1 Hz anyway), read the last 20 lines or so from the page, throw away the html tags, if any, as well as incomplete json objects at the beginning of the sample, and deserialize the few objects left... then do a bunch of conditional logic to keep track of labels to know when they change, average confidences, etc. I could also fork your fork and write a socket or something that just pumps out labels and confidence numbers, but time is short for this project so I'll probably stick with the current approach.

Success! One of the other used cards arrived and I installed it yesterday, training to about 22,000 iterations before I stopped it. The card in question was listed as a PNY card, but interestingly it has the same subvendor as the card it replaced, which is HP, and looks just like any stock 980 Ti. It seems to have a similar BIOS because like the other 'HP' card, it had lower fan curves so runs slightly hotter than the other cards, and its usage is a little spikier. So it would appear that the 4th card, i.e. the previous HP card, has some sort of issue. It's still strange to me that the symptom is so specific. Not only does it only occur at specifically 10,500 iterations every time, but it only happens if 4 cards are being used. If 3 cards are being used, the problem card behaves normally. For instance, training with GPUs 0,1,3 worked fine. The only reason I can isolate it is because 4 GPU training works once it has been replaced. So unfortunately I probably won't ever have a satisfactory root cause.

If I have a chance, I might download the BIOS from the working card and try flashing it onto the malfunctioning card since they are the same card and have the same subvendor. They even have the same BIOS version, which is 84.00.4b.00.05; but possibly the BIOS was modified in some way by the previous owner.

chart

In case other people stumble across this later ... this happens frequently to me if I'm using the same computer that is training to browse the internet using Chrome. Chrome uses hardware acceleration and at times may consume GPU resources leaving not enough RAM for the GPU to handle training. Turning off 'Use Hardware Acceleration' in Chrome will force it to use the CPU and prevent this from intermittently happening.

AlexeyAB / darknet

"CUDA Error: out of memory" on Win10/4x 980 Tis after 10,500 iterations #4533