PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.05k stars 5.54k forks source link

训练模型的时候报错:W0727 13:09:26.365279 5320 sampler.cpp:139] bvar is busy at sampling for 2 seconds! #55740

Closed liangshu-code closed 1 month ago

liangshu-code commented 1 year ago

问题描述 Issue Description

前面训练的时候都好好的 可是到第二个周期就开始电脑死机然后下面就提示 bvar is busy at sampling for 2 seconds! 已杀死

W0727 11:59:08.600850 5305 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 12.2, Runtime API Version: 10.2 W0727 11:59:08.946987 5305 device_context.cc:422] device: 0, cuDNN Version: 8.4. /home/test/.local/lib/python3.8/site-packages/paddle/fluid/reader.py:139: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if arr.dtype == np.object: Train [2023-07-27 11:59:47.066181] epoch: [1/50], batch: [0/8832], learning rate: 0.00050000, train loss: 361.576355, eta: 125 days, 15:34:18 Train [2023-07-27 12:00:09.235985] epoch: [1/50], batch: [100/8832], learning rate: 0.00050000, train loss: 52.604530, eta: 1 day, 4:47:25 Train [2023-07-27 12:00:34.371266] epoch: [1/50], batch: [200/8832], learning rate: 0.00050000, train loss: 55.890755, eta: 1 day, 6:20:33 Train [2023-07-27 12:01:00.692479] epoch: [1/50], batch: [300/8832], learning rate: 0.00050000, train loss: 55.849007, eta: 1 day, 7:47:28 Train [2023-07-27 12:01:27.893280] epoch: [1/50], batch: [400/8832], learning rate: 0.00050000, train loss: 61.413296, eta: 1 day, 8:49:58 Train [2023-07-27 12:01:55.689967] epoch: [1/50], batch: [500/8832], learning rate: 0.00050000, train loss: 62.886391, eta: 1 day, 9:15:47 Train [2023-07-27 12:02:23.909564] epoch: [1/50], batch: [600/8832], learning rate: 0.00050000, train loss: 60.961082, eta: 1 day, 9:52:35 Train [2023-07-27 12:02:53.449970] epoch: [1/50], batch: [700/8832], learning rate: 0.00050000, train loss: 64.995392, eta: 1 day, 11:33:29 Train [2023-07-27 12:03:23.530804] epoch: [1/50], batch: [800/8832], learning rate: 0.00050000, train loss: 63.691593, eta: 1 day, 11:19:47 Train [2023-07-27 12:03:53.953519] epoch: [1/50], batch: [900/8832], learning rate: 0.00050000, train loss: 65.695801, eta: 1 day, 12:16:12 Train [2023-07-27 12:04:24.807534] epoch: [1/50], batch: [1000/8832], learning rate: 0.00050000, train loss: 62.732803, eta: 1 day, 12:52:01 Train [2023-07-27 12:04:55.890704] epoch: [1/50], batch: [1100/8832], learning rate: 0.00050000, train loss: 68.085480, eta: 1 day, 12:17:09 Train [2023-07-27 12:05:27.341919] epoch: [1/50], batch: [1200/8832], learning rate: 0.00050000, train loss: 70.452652, eta: 1 day, 13:41:05 Train [2023-07-27 12:05:59.170219] epoch: [1/50], batch: [1300/8832], learning rate: 0.00050000, train loss: 68.136703, eta: 1 day, 13:57:11 Train [2023-07-27 12:06:31.378231] epoch: [1/50], batch: [1400/8832], learning rate: 0.00050000, train loss: 66.143089, eta: 1 day, 14:31:40 Train [2023-07-27 12:07:03.872622] epoch: [1/50], batch: [1500/8832], learning rate: 0.00050000, train loss: 69.777069, eta: 1 day, 14:34:08 Train [2023-07-27 12:07:36.685417] epoch: [1/50], batch: [1600/8832], learning rate: 0.00050000, train loss: 71.627701, eta: 1 day, 15:05:28 Train [2023-07-27 12:08:09.803918] epoch: [1/50], batch: [1700/8832], learning rate: 0.00050000, train loss: 67.863083, eta: 1 day, 14:52:32 Train [2023-07-27 12:08:43.502708] epoch: [1/50], batch: [1800/8832], learning rate: 0.00050000, train loss: 68.935463, eta: 1 day, 15:19:16 Train [2023-07-27 12:09:17.825033] epoch: [1/50], batch: [1900/8832], learning rate: 0.00050000, train loss: 67.752731, eta: 1 day, 16:37:04 Train [2023-07-27 12:09:52.156604] epoch: [1/50], batch: [2000/8832], learning rate: 0.00050000, train loss: 71.158821, eta: 1 day, 17:17:44 Train [2023-07-27 12:10:26.773113] epoch: [1/50], batch: [2100/8832], learning rate: 0.00050000, train loss: 79.051300, eta: 1 day, 16:17:03 Train [2023-07-27 12:11:01.737287] epoch: [1/50], batch: [2200/8832], learning rate: 0.00050000, train loss: 74.789070, eta: 1 day, 18:02:04 Train [2023-07-27 12:11:36.880230] epoch: [1/50], batch: [2300/8832], learning rate: 0.00050000, train loss: 73.898422, eta: 1 day, 17:17:52 Train [2023-07-27 12:12:12.323504] epoch: [1/50], batch: [2400/8832], learning rate: 0.00050000, train loss: 76.780502, eta: 1 day, 18:14:18 Train [2023-07-27 12:12:48.035792] epoch: [1/50], batch: [2500/8832], learning rate: 0.00050000, train loss: 78.111748, eta: 1 day, 18:10:20 Train [2023-07-27 12:13:24.164443] epoch: [1/50], batch: [2600/8832], learning rate: 0.00050000, train loss: 84.636711, eta: 1 day, 18:37:48 Train [2023-07-27 12:14:00.883305] epoch: [1/50], batch: [2700/8832], learning rate: 0.00050000, train loss: 75.107399, eta: 1 day, 19:43:00 Train [2023-07-27 12:14:37.910400] epoch: [1/50], batch: [2800/8832], learning rate: 0.00050000, train loss: 71.918976, eta: 1 day, 20:31:06 Train [2023-07-27 12:15:15.059166] epoch: [1/50], batch: [2900/8832], learning rate: 0.00050000, train loss: 75.536819, eta: 1 day, 19:44:12 Train [2023-07-27 12:15:52.589761] epoch: [1/50], batch: [3000/8832], learning rate: 0.00050000, train loss: 81.680862, eta: 1 day, 20:56:20 Train [2023-07-27 12:16:30.451239] epoch: [1/50], batch: [3100/8832], learning rate: 0.00050000, train loss: 81.360764, eta: 1 day, 21:11:44 Train [2023-07-27 12:17:08.494063] epoch: [1/50], batch: [3200/8832], learning rate: 0.00050000, train loss: 85.435516, eta: 1 day, 20:45:53 Train [2023-07-27 12:17:46.963009] epoch: [1/50], batch: [3300/8832], learning rate: 0.00050000, train loss: 78.826500, eta: 1 day, 22:06:41 Train [2023-07-27 12:18:25.894977] epoch: [1/50], batch: [3400/8832], learning rate: 0.00050000, train loss: 85.944405, eta: 1 day, 22:23:24 Train [2023-07-27 12:19:05.166885] epoch: [1/50], batch: [3500/8832], learning rate: 0.00050000, train loss: 83.194923, eta: 1 day, 22:26:17 Train [2023-07-27 12:19:44.681032] epoch: [1/50], batch: [3600/8832], learning rate: 0.00050000, train loss: 88.165695, eta: 1 day, 23:12:04 Train [2023-07-27 12:20:24.644921] epoch: [1/50], batch: [3700/8832], learning rate: 0.00050000, train loss: 86.393478, eta: 1 day, 23:57:50 Train [2023-07-27 12:21:04.755584] epoch: [1/50], batch: [3800/8832], learning rate: 0.00050000, train loss: 91.498108, eta: 1 day, 23:45:11 Train [2023-07-27 12:21:45.247511] epoch: [1/50], batch: [3900/8832], learning rate: 0.00050000, train loss: 84.491585, eta: 2 days, 0:25:29 Train [2023-07-27 12:22:26.182119] epoch: [1/50], batch: [4000/8832], learning rate: 0.00050000, train loss: 86.496086, eta: 2 days, 1:02:03 Train [2023-07-27 12:23:07.509179] epoch: [1/50], batch: [4100/8832], learning rate: 0.00050000, train loss: 89.490135, eta: 2 days, 1:56:30 Train [2023-07-27 12:23:49.142086] epoch: [1/50], batch: [4200/8832], learning rate: 0.00050000, train loss: 89.504593, eta: 2 days, 0:57:51 Train [2023-07-27 12:24:30.959731] epoch: [1/50], batch: [4300/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 1:53:56 Train [2023-07-27 12:25:13.079878] epoch: [1/50], batch: [4400/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 2:08:04 Train [2023-07-27 12:25:55.459804] epoch: [1/50], batch: [4500/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 2:28:44 Train [2023-07-27 12:26:38.501813] epoch: [1/50], batch: [4600/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 3:26:14 Train [2023-07-27 12:27:22.091607] epoch: [1/50], batch: [4700/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 4:13:00 Train [2023-07-27 12:28:06.113606] epoch: [1/50], batch: [4800/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 4:34:34 Train [2023-07-27 12:28:50.503406] epoch: [1/50], batch: [4900/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 4:58:13 Train [2023-07-27 12:29:35.195826] epoch: [1/50], batch: [5000/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 4:44:44 Train [2023-07-27 12:30:20.262484] epoch: [1/50], batch: [5100/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 5:23:36 Train [2023-07-27 12:31:05.673918] epoch: [1/50], batch: [5200/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 5:54:34 Train [2023-07-27 12:31:51.569141] epoch: [1/50], batch: [5300/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 6:20:32 Train [2023-07-27 12:32:37.900467] epoch: [1/50], batch: [5400/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 7:08:49 Train [2023-07-27 12:33:24.599883] epoch: [1/50], batch: [5500/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 7:15:31 Train [2023-07-27 12:34:11.806402] epoch: [1/50], batch: [5600/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 7:59:55 Train [2023-07-27 12:34:59.814519] epoch: [1/50], batch: [5700/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 9:20:08 Train [2023-07-27 12:35:48.458903] epoch: [1/50], batch: [5800/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 9:45:43 Train [2023-07-27 12:36:37.530408] epoch: [1/50], batch: [5900/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 10:00:04 Train [2023-07-27 12:37:27.065753] epoch: [1/50], batch: [6000/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 11:06:44 Train [2023-07-27 12:38:16.940994] epoch: [1/50], batch: [6100/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 11:15:34 Train [2023-07-27 12:39:07.172125] epoch: [1/50], batch: [6200/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 11:34:36 Train [2023-07-27 12:39:57.918220] epoch: [1/50], batch: [6300/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 11:55:56 Train [2023-07-27 12:40:49.334351] epoch: [1/50], batch: [6400/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 13:11:07 Train [2023-07-27 12:41:41.108207] epoch: [1/50], batch: [6500/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 13:36:35 Train [2023-07-27 12:42:34.020744] epoch: [1/50], batch: [6600/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 14:43:58 Train [2023-07-27 12:43:27.598514] epoch: [1/50], batch: [6700/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 15:44:11 Train [2023-07-27 12:44:21.816199] epoch: [1/50], batch: [6800/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 16:03:55 Train [2023-07-27 12:45:16.429384] epoch: [1/50], batch: [6900/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 16:38:09 Train [2023-07-27 12:46:11.732665] epoch: [1/50], batch: [7000/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 16:50:49 Train [2023-07-27 12:47:07.681591] epoch: [1/50], batch: [7100/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 18:53:02 Train [2023-07-27 12:48:04.198851] epoch: [1/50], batch: [7200/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 18:47:27 Train [2023-07-27 12:49:01.855232] epoch: [1/50], batch: [7300/8832], learning rate: 0.00050000, train loss: nan, eta: 3 days, 4:00:37 Train [2023-07-27 12:50:00.280988] epoch: [1/50], batch: [7400/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 21:10:33 Train [2023-07-27 12:50:59.514414] epoch: [1/50], batch: [7500/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 22:11:24 Train [2023-07-27 12:51:59.515912] epoch: [1/50], batch: [7600/8832], learning rate: 0.00050000, train loss: nan, eta: 2 days, 23:00:00 Train [2023-07-27 12:53:00.303995] epoch: [1/50], batch: [7700/8832], learning rate: 0.00050000, train loss: nan, eta: 3 days, 1:06:29 Train [2023-07-27 12:54:01.989783] epoch: [1/50], batch: [7800/8832], learning rate: 0.00050000, train loss: nan, eta: 3 days, 1:46:48 Train [2023-07-27 12:55:05.095985] epoch: [1/50], batch: [7900/8832], learning rate: 0.00050000, train loss: nan, eta: 3 days, 3:44:28 Train [2023-07-27 12:56:09.355517] epoch: [1/50], batch: [8000/8832], learning rate: 0.00050000, train loss: nan, eta: 3 days, 4:25:34 Train [2023-07-27 12:57:14.754748] epoch: [1/50], batch: [8100/8832], learning rate: 0.00050000, train loss: nan, eta: 3 days, 6:25:42 Train [2023-07-27 12:58:22.660837] epoch: [1/50], batch: [8200/8832], learning rate: 0.00050000, train loss: nan, eta: 3 days, 8:43:51 Train [2023-07-27 12:59:31.238281] epoch: [1/50], batch: [8300/8832], learning rate: 0.00050000, train loss: nan, eta: 3 days, 10:04:54 Train [2023-07-27 13:00:41.372685] epoch: [1/50], batch: [8400/8832], learning rate: 0.00050000, train loss: nan, eta: 3 days, 11:38:03 Train [2023-07-27 13:01:54.070809] epoch: [1/50], batch: [8500/8832], learning rate: 0.00050000, train loss: nan, eta: 3 days, 15:12:48 Train [2023-07-27 13:03:10.024240] epoch: [1/50], batch: [8600/8832], learning rate: 0.00050000, train loss: nan, eta: 3 days, 20:33:52 Train [2023-07-27 13:04:30.293408] epoch: [1/50], batch: [8700/8832], learning rate: 0.00050000, train loss: nan, eta: 4 days, 2:34:18 Train [2023-07-27 13:06:01.480059] epoch: [1/50], batch: [8800/8832], learning rate: 0.00050000, train loss: nan, eta: 4 days, 19:35:28 W0727 13:09:26.365279 5320 sampler.cpp:139] bvar is busy at sampling for 2 seconds! W0727 13:10:33.379230 5320 sampler.cpp:139] bvar is busy at sampling for 2 seconds! W0727 13:12:03.142860 5320 sampler.cpp:139] bvar is busy at sampling for 2 seconds! W0727 13:14:12.113915 5320 sampler.cpp:139] bvar is busy at sampling for 2 seconds! W0727 13:15:03.425264 5320 sampler.cpp:139] bvar is busy at sampling for 2 seconds! W0727 13:17:22.002276 5320 sampler.cpp:139] bvar is busy at sampling for 2 seconds! W0727 13:20:31.183709 5320 sampler.cpp:139] bvar is busy at sampling for 2 seconds! W0727 13:22:06.674037 5320 sampler.cpp:139] bvar is busy at sampling for 2 seconds!

版本&环境信息 Version & Environment Information

Paddle version: 2.1.3 Paddle With CUDA: True

OS: ubuntu 20.04 GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: N/A CMake version: N/A Libc version: glibc 2.31 Python version: 3.8.0

CUDA version: 11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0 cuDNN version: N/A Nvidia driver version: 535.54.03 Nvidia driver List: GPU 0: Quadro P5000

zhwesky2010 commented 1 year ago

@liangshu-code 你好,我参考了一下其他相关issue,其他同学经验是一般是操作系统自身问题,可能是机器内存爆了 https://github.com/PaddlePaddle/PaddleNLP/issues/2653

paddle-bot[bot] commented 1 month ago

Since you haven\'t replied for more than a year, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过一年未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。