PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.09k stars 5.55k forks source link

OSError: (External) CUDA error(719), unspecified launch failure. #57411

Closed Grula closed 1 year ago

Grula commented 1 year ago

bug描述 Describe the Bug

Describe the Bug: Following the tuorial to run text detection training on icdar2015 dataset. $ python3 tools/train.py -c configs/det/det_r50_vd_db.yml Text Detection. Everything is same as showin in example, small adjustment was made in yml file for number of batch images. After a while (5-10 mins) terminal would freeze and following error would show:


[2023/09/15 19:33:12] ppocr INFO: epoch: [3/1200], global_step: 750, lr: 0.001000, loss: 2.015131, loss_shrink_maps: 1.127646, loss_threshold_maps: 0.613376, loss_binary_maps: 0.226934, loss_cbn: 0.000000, avg_reader_cost: 0.00010 s, avg_batch_cost: 0.18020 s, avg_samples: 4.0, ips: 22.19765 samples/s, eta: 15:13:26
[2023/09/15 19:33:12] ppocr INFO: save model in ./output/det_r50_vd/latest
[2023/09/15 19:33:15] ppocr INFO: epoch: [4/1200], global_step: 760, lr: 0.001000, loss: 1.859309, loss_shrink_maps: 1.059979, loss_threshold_maps: 0.603668, loss_binary_maps: 0.213640, loss_cbn: 0.000000, avg_reader_cost: 0.06058 s, avg_batch_cost: 0.24114 s, avg_samples: 4.0, ips: 16.58761 samples/s, eta: 15:17:13
[2023/09/15 19:33:18] ppocr INFO: epoch: [4/1200], global_step: 770, lr: 0.001000, loss: 2.244609, loss_shrink_maps: 1.362141, loss_threshold_maps: 0.616605, loss_binary_maps: 0.272549, loss_cbn: 0.000000, avg_reader_cost: 0.00011 s, avg_batch_cost: 0.18096 s, avg_samples: 4.0, ips: 22.10377 samples/s, eta: 15:16:59
Traceback (most recent call last):
  File "/home/platypus/projects/PaddleOCR/tools/train.py", line 227, in <module>
    main(config, device, logger, vdl_writer)
  File "/home/platypus/projects/PaddleOCR/tools/train.py", line 198, in main
    program.train(config, train_dataloader, valid_dataloader, device, model,
  File "/home/platypus/projects/PaddleOCR/tools/program.py", line 349, in train
    stats = {
  File "/home/platypus/projects/PaddleOCR/tools/program.py", line 350, in <dictcomp>
    k: float(v) if v.shape == [] else v.numpy().mean()
  File "/home/platypus/miniconda3/envs/paddle/lib/python3.10/site-packages/paddle/fluid/dygraph/math_op_patch.py", line 117, in _float_
    return float(np.array(var).flatten()[0])
  File "/home/platypus/miniconda3/envs/paddle/lib/python3.10/site-packages/paddle/fluid/dygraph/tensor_patch_methods.py", line 696, in __array__
    array = self.numpy(False)
OSError: (External) CUDA error(719), unspecified launch failure. 
  [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:267)

OS information :

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

GPU Information:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        On  | 00000000:05:00.0 Off |                  N/A |
| 48%   65C    P2             138W / 170W |   4999MiB / 12288MiB |     99%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1386      G   /usr/lib/xorg/Xorg                           25MiB |
|    0   N/A  N/A      1430      G   /usr/bin/gnome-shell                          7MiB |
|    0   N/A  N/A    120257      C   python3                                    4956MiB |
+---------------------------------------------------------------------------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

cuDNN Version: 8.5.

From paddle:

W0915 19:37:09.499838 120257 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.2, Runtime API Version: 11.7
W0915 19:37:09.502058 120257 gpu_resources.cc:149] device: 0, cuDNN Version: 8.5.

CPU information

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             12
On-line CPU(s) list:                0-11
Thread(s) per core:                 2
Core(s) per socket:                 6
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              33
Model name:                         AMD Ryzen 5 5600 6-Core Processor
Stepping:                           2
Frequency boost:                    enabled
CPU MHz:                            2616.856
CPU max MHz:                        3500.0000
CPU min MHz:                        2200.0000
BogoMIPS:                           6987.15
Virtualization:                     AMD-V
L1d cache:                          192 KiB
L1i cache:                          192 KiB
L2 cache:                           3 MiB
L3 cache:                           32 MiB
NUMA node0 CPU(s):                  0-11
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl 
                                    nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_
                                    legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb st
                                    ibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm
                                    _mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgi
                                    f umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca

installed paddle libraries ( as stated in documentation)

$ python3 -m pip install paddlepaddle-gpu==2.5.1.post117 -f https://www.paddlepaddle.org.cn/whl/linux/cudnnin/stable.html

python & conda versions:

Python 3.10.13
conda 23.7.4

Run and cloned from brach: origin/release/2.4 (79fc4c5befac276de7208d0dd8f8ac0bc2d5e2b1)

其他补充信息 Additional Supplementary Information

Sidenots: I've noticed server freezing for couple of seconds before displaying the error, following message is also found from $ dmesg

[74341.482755] OOM killer enabled.
[74341.482755] Restarting tasks ... done.
[74341.484843] PM: suspend exit
[74341.486153] NVRM: Xid (PCI:0000:05:00): 31, pid=120257, name=python3, Ch 00000024, intr 00000000. MMU Fault: ENGINE HOST3 HUBCLIENT_ESC faulted @ 0x2_00224000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[74341.488630] rfkill: input handler enabled
[74341.687025] rfkill: input handler disabled
dyh commented 9 months ago

ocr infer python code, meet the same error, have you solve this, guys?

` [{'res': 'B', 'score': 0.9999068975448608, 'left_top_xy': [21.0, 13.0], 'right_top_xy': [43.0, 18.0], 'right_bottom_xy': [37.0, 49.0], 'left_bottom_xy': [15.0, 44.0], 'bbox_center_xy': (29, 31)}] [2023/12/21 16:34:59] ppocr DEBUG: dt_boxes num : 12, elapsed : 0.027253150939941406 [2023/12/21 16:34:59] ppocr DEBUG: cls num : 12, elapsed : 0.024739980697631836 (External) CUDA error(719), unspecified launch failure. [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:272) [operator < reshape2 > error] (External) CUDA error(719), unspecified launch failure. [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:265)

(External) CUDA error(719), unspecified launch failure. [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:265)

(External) CUDA error(719), unspecified launch failure. [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:265) `

Grula commented 9 months ago

@dyh What i found in my case was bug in nvidia driver, after some digging on nvidia forums, i found that driver version has that bug fixed is: NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8

dyh commented 9 months ago

@Grula thank you for your reply!!! we update paddle version -> PaddlePaddle 2.5.0 , and it works fine.

dcdethan commented 5 months ago

我也遇到这个问题了,我是ocr文本框坐标超出图像宽高范围,是旋转导致,旋转好弄成一致的就可以了