Closed LLsmile closed 2 years ago
GPU0处于异常状态是指?
gpu0上有多个线程在运行,而且程序直接卡死了,一张卡可以训练,多张卡就卡在上面那个状态了
而且奇怪的是,6张卡就卡死了,4张卡可以训练
这个应该是hang住了,你可以提供下训练环境,我们定位下
训练可以先用4卡训练
训练环境: cuda: 11.6 cudnn: cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive.tar.xz nccl: 因为服务器没有sudo权限,昨天刚从git上clone下来编译的 其他python包都是按照paddledetection说明安装的,主要就pyyaml + paddle + paddledetection,paddledetection的测试程序正常通过。 单卡训练120iter就挂掉了,多卡就跑不起来(包括上面说的4卡)。
最新训练命令和报错:
python -m paddle.distributed.launch --gpus 0,1,2,3 tools/train.py -c configs/detr/detr_r50_1x_coco.yml --eval
----------- Configuration Arguments -----------
backend: auto
elastic_server: None
force: False
gpus: 0,1,2,3
heter_devices:
heter_worker_num: None
heter_workers:
host: None
http_port: None
ips: 127.0.0.1
job_id: None
log_dir: log
np: None
nproc_per_node: None
run_mode: None
scale: 0
server_num: None
servers:
training_script: tools/train.py
training_script_args: ['-c', 'configs/detr/detr_r50_1x_coco.yml', '--eval']
worker_num: None
workers:
------------------------------------------------
WARNING 2022-03-12 09:48:41,703 launch.py:422] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode
launch train in GPU mode!
INFO 2022-03-12 09:48:41,704 launch_utils.py:525] Local start 4 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:43025 |
| PADDLE_TRAINERS_NUM 4 |
| PADDLE_TRAINER_ENDPOINTS ... 0.1:56729,127.0.0.1:42147,127.0.0.1:57509|
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
| PADDLE_WORLD_DEVICE_IDS 0,1,2,3 |
| FLAGS_selected_gpus 0 |
| FLAGS_selected_accelerators 0 |
+=======================================================================================+
INFO 2022-03-12 09:48:41,704 launch_utils.py:530] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
launch proc_id:2324247 idx:0
launch proc_id:2324252 idx:1
launch proc_id:2324257 idx:2
launch proc_id:2324262 idx:3
/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:130: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if data.dtype == np.object:
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.0.1:56729', '127.0.0.1:42147', '127.0.0.1:57509']
INFO 2022-03-12 09:48:44,738 launch_utils.py:320] terminate process group gid:2324247
INFO 2022-03-12 09:48:48,742 launch_utils.py:341] terminate all the procs
ERROR 2022-03-12 09:48:48,742 launch_utils.py:602] ABORT!!! Out of all 4 trainers, the trainer process with rank=[1, 2, 3] was aborted. Please check its log.
INFO 2022-03-12 09:48:52,746 launch_utils.py:341] terminate all the procs
INFO 2022-03-12 09:48:52,746 launch.py:311] Local processes completed.
单卡120iteration后OOM日志:
WARNING 2022-03-12 09:32:47,422 launch.py:422] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode launch train in GPU mode! INFO 2022-03-12 09:32:47,422 launch_utils.py:525] Local start 1 processes. First process distributed environment info (Only For Debug): +=======================================================================================+ | Distributed Envs Value | +---------------------------------------------------------------------------------------+ | PADDLE_TRAINER_ID 0 | | PADDLE_CURRENT_ENDPOINT 127.0.0.1:47835 | | PADDLE_TRAINERS_NUM 1 | | PADDLE_TRAINER_ENDPOINTS 127.0.0.1:47835 | | PADDLE_RANK_IN_NODE 0 | | PADDLE_LOCAL_DEVICE_IDS 0 | | PADDLE_WORLD_DEVICE_IDS 0 | | FLAGS_selected_gpus 0 | | FLAGS_selected_accelerators 0 | +=======================================================================================+
INFO 2022-03-12 09:32:47,422 launch_utils.py:530] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
launch proc_id:2321426 idx:0
/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:130: DeprecationWarning: np.object
is a deprecated alias for the builtin object
. To silence this warning, use object
by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if data.dtype == np.object:
loading annotations into memory...
Done (t=0.15s)
creating index...
index created!
W0312 09:32:50.000304 2321426 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.4, Runtime API Version: 11.1
W0312 09:32:50.002843 2321426 device_context.cc:465] device: 0, cuDNN Version: 8.2.
[03/12 09:32:52] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/xyz/.cache/paddle/weights/ResNet50_vb_normal_pretrained.pdparams
[03/12 09:32:53] ppdet.engine INFO: Epoch: [0] [ 0/1680] learning_rate: 0.000100 loss_class: 0.582332 loss_bbox: 4.075541 loss_giou: 2.120999 loss_class_aux: 3.271304 loss_bbox_aux: 20.612900 loss_giou_aux: 10.542332 loss: 41.205406 eta: 8 days, 16:13:16 batch_cost: 0.8924 data_cost: 0.0001 ips: 2.2412 images/s
[03/12 09:32:58] ppdet.engine INFO: Epoch: [0] [ 20/1680] learning_rate: 0.000100 loss_class: 0.638008 loss_bbox: 2.514911 loss_giou: 2.325258 loss_class_aux: 3.177923 loss_bbox_aux: 12.664510 loss_giou_aux: 11.671243 loss: 33.304543 eta: 2 days, 21:31:16 batch_cost: 0.2682 data_cost: 0.0001 ips: 7.4562 images/s
[03/12 09:33:14] ppdet.engine INFO: Epoch: [0] [ 40/1680] learning_rate: 0.000100 loss_class: 0.639883 loss_bbox: 2.334148 loss_giou: 2.305621 loss_class_aux: 3.143493 loss_bbox_aux: 11.770285 loss_giou_aux: 11.707081 loss: 31.909885 eta: 5 days, 3:09:09 batch_cost: 0.7692 data_cost: 0.0001 ips: 2.6002 images/s
[03/12 09:33:19] ppdet.engine INFO: Epoch: [0] [ 60/1680] learning_rate: 0.000100 loss_class: 0.646600 loss_bbox: 1.935449 loss_giou: 2.062631 loss_class_aux: 2.831934 loss_bbox_aux: 9.859499 loss_giou_aux: 10.329891 loss: 27.717958 eta: 4 days, 6:43:06 batch_cost: 0.2607 data_cost: 0.0001 ips: 7.6706 images/s
[03/12 09:33:24] ppdet.engine INFO: Epoch: [0] [ 80/1680] learning_rate: 0.000100 loss_class: 0.599850 loss_bbox: 1.185453 loss_giou: 1.307388 loss_class_aux: 2.824816 loss_bbox_aux: 6.330995 loss_giou_aux: 7.080967 loss: 18.742527 eta: 3 days, 21:13:12 batch_cost: 0.2754 data_cost: 0.0001 ips: 7.2617 images/s
[03/12 09:33:30] ppdet.engine INFO: Epoch: [0] [ 100/1680] learning_rate: 0.000100 loss_class: 0.642262 loss_bbox: 0.906488 loss_giou: 0.888945 loss_class_aux: 3.464796 loss_bbox_aux: 4.355124 loss_giou_aux: 4.804581 loss: 15.335629 eta: 3 days, 15:12:05 batch_cost: 0.2693 data_cost: 0.0001 ips: 7.4260 images/s
[03/12 09:33:35] ppdet.engine INFO: Epoch: [0] [ 120/1680] learning_rate: 0.000100 loss_class: 0.665200 loss_bbox: 0.827738 loss_giou: 0.843559 loss_class_aux: 3.410576 loss_bbox_aux: 4.296345 loss_giou_aux: 4.450683 loss: 14.734106 eta: 3 days, 11:14:01 batch_cost: 0.2709 data_cost: 0.0001 ips: 7.3821 images/s
Traceback (most recent call last):
File "tools/train.py", line 177, in
Out of memory error on GPU 0. Cannot allocate 43.066650MB memory on GPU 0, 5.743042GB memory has been allocated and available memory is only 51.187500MB.
Please check whether there is any other process using GPU 0.
If no, please decrease the batch size of your model.
(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79) . (at /paddle/paddle/fluid/imperative/tracer.cc:221)
INFO 2022-03-12 09:33:57,476 launch_utils.py:341] terminate all the procs ERROR 2022-03-12 09:33:57,476 launch_utils.py:602] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log. INFO 2022-03-12 09:34:01,480 launch_utils.py:341] terminate all the procs INFO 2022-03-12 09:34:01,481 launch.py:311] Local processes completed. `
显存溢出后应该使用ps aux | grep detr | awk '{pritn $2}' | xargs kill -9杀掉进程,避免显存溢出后进程没有退出
训练环境说一下PaddlePaddle和PaddleDetection的版本
paddlepaddle版本号:
paddle version PaddlePaddle 2.2.2.post112, compiled with with_avx: ON with_gpu: ON with_mkl: ON with_mkldnn: ON with_python: ON
paddledetection是v2.3.0
现在能够4卡还可以正常训练吗?
刚试了一下,可以训练,但是gpu0上还是挂了多个线程
挂了多个线程是指什么?
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3824253 C ...a3/envs/paddle/bin/python 6057MiB | | 0 N/A N/A 3824258 C ...a3/envs/paddle/bin/python 393MiB | | 0 N/A N/A 3824263 C ...a3/envs/paddle/bin/python 393MiB | | 0 N/A N/A 3824268 C ...a3/envs/paddle/bin/python 393MiB | | 1 N/A N/A 3824258 C ...a3/envs/paddle/bin/python 7723MiB | | 2 N/A N/A 3824263 C ...a3/envs/paddle/bin/python 6463MiB | | 3 N/A N/A 3824268 C ...a3/envs/paddle/bin/python 5815MiB |
而且单卡训练的时候,就算我指定了gpu9,也会默认在gpu0上开一个小线程,可能也是大概393M吧
而且单卡训练的时候,就算我指定了gpu9,也会默认在gpu0上开一个小线程,可能也是大概393M吧
这个是使用了distributed的启动方式吗?如果是的话,为了避免这种情况,可以设置CUDA_VISIBLE_DEVICES为你要使用的卡
上面那个gpu状态就是这句代码的结果
python -m paddle.distributed.launch --gpus 0,1,2,3 tools/train.py -c configs/deformable_detr/deformable_detr_r50_1x_coco.yml --eval
上面那个gpu状态就是这句代码的结果
python -m paddle.distributed.launch --gpus 0,1,2,3 tools/train.py -c configs/deformable_detr/deformable_detr_r50_1x_coco.yml --eval
分布式的小线程应该是distributed.launch引入的
而且单卡训练的时候,就算我指定了gpu9,也会默认在gpu0上开一个小线程,可能也是大概393M吧
这个是使用了distributed的启动方式吗?如果是的话,为了避免这种情况,可以设置CUDA_VISIBLE_DEVICES为你要使用的卡
这个问题比较准确的描述就是无论我使用了哪张卡,都会在gpu0上多一个线程出来。一张卡如此,多张卡也是这样。
上面那个gpu状态就是这句代码的结果
python -m paddle.distributed.launch --gpus 0,1,2,3 tools/train.py -c configs/deformable_detr/deformable_detr_r50_1x_coco.yml --eval
分布式的小线程应该是distributed.launch引入的
那正常训练不是这样的吗?
而且单卡训练的时候,就算我指定了gpu9,也会默认在gpu0上开一个小线程,可能也是大概393M吧
这个是使用了distributed的启动方式吗?如果是的话,为了避免这种情况,可以设置CUDA_VISIBLE_DEVICES为你要使用的卡
这个问题比较准确的描述就是无论我使用了哪张卡,都会在gpu0上多一个线程出来。一张卡如此,多张卡也是这样。
看上面的回复,可以设置下可见性看看
上面那个gpu状态就是这句代码的结果
python -m paddle.distributed.launch --gpus 0,1,2,3 tools/train.py -c configs/deformable_detr/deformable_detr_r50_1x_coco.yml --eval
分布式的小线程应该是distributed.launch引入的
那正常训练不是这样的吗?
所以多了一个小线程是正常的,如果你不用gpu0,gpu0上也多了一个小线程的话,可以试下使用CUDA_VISIBLE_DEVICES设置下可见性
如果这个算正常,那暂时没有问题了
训练模型时,gpu0 处于异常状态,训练速度特别慢,不知道是不是nccl没装好 训练模型命令: python -m paddle.distributed.launch --gpus 0,1,2,3,4,5 tools/train.py -c configs/deformable_detr/deformable_detr_r50_1x_coco.yml --eval
LOG信息: **_----------- Configuration Arguments ----------- backend: auto elastic_server: None force: False gpus: 0,1,2,3,4,5 heter_devices: heter_worker_num: None heter_workers: host: None http_port: None ips: 127.0.0.1 job_id: None log_dir: log np: None nproc_per_node: None run_mode: None scale: 0 server_num: None servers: training_script: tools/train.py training_script_args: ['-c', 'configs/deformable_detr/deformable_detr_r50_1x_coco.yml', '--eval'] worker_num: None workers:
WARNING 2022-03-11 10:08:07,354 launch.py:422] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode launch train in GPU mode! INFO 2022-03-11 10:08:07,355 launch_utils.py:525] Local start 6 processes. First process distributed environment info (Only For Debug): +=======================================================================================+ | Distributed Envs Value | +---------------------------------------------------------------------------------------+ | PADDLE_TRAINER_ID 0 | | PADDLE_CURRENT_ENDPOINT 127.0.0.1:52035 | | PADDLE_TRAINERS_NUM 6 | | PADDLE_TRAINER_ENDPOINTS ... 0.1:37911,127.0.0.1:56441,127.0.0.1:60701| | PADDLE_RANK_IN_NODE 0 | | PADDLE_LOCAL_DEVICE_IDS 0 | | PADDLE_WORLD_DEVICE_IDS 0,1,2,3,4,5 | | FLAGS_selected_gpus 0 | | FLAGS_selected_accelerators 0 | +=======================================================================================+
INFO 2022-03-11 10:08:07,355 launch_utils.py:530] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0 launch proc_id:3741729 idx:0 launch proc_id:3741734 idx:1 launch proc_id:3741740 idx:2 launch proc_id:3741745 idx:3 launch proc_id:3741750 idx:4 launch proc_id:3741755 idx:5 /home/liaolin/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:130: DeprecationWarning:
np.object
is a deprecated alias for the builtinobject
. To silence this warning, useobject
by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:57999', '127.0.0.1:48623', '127.0.0.1:37911', '127.0.0.1:56441', '127.0.0.1:60701'] I0311 10:08:15.285431 3741729 nccl_context.cc:74] init nccl context nranks: 6 local rank: 0 gpu id: 0 ring id: 0 W0311 10:08:17.462484 3741729 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 11.2 W0311 10:08:17.477650 3741729 device_context.cc:465] device: 0, cuDNN Version: 8.3. loading annotations into memory... Done (t=1.14s) creating index... index created! [03/11 10:08:20] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 5081, area: 15734.0 x1: 309, y1: 197, x2: 433.2765957446809, y2: 197. [03/11 10:08:20] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 5854, area: 22190.0 x1: 277, y1: 195, x2: 421, y2: 195. [03/11 10:08:21] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 18903, area: 21387.5 x1: 409, y1: 164, x2: 530, y2: 164. [03/11 10:08:21] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 19997, area: 22563.0 x1: 25, y1: 212, x2: 162, y2: 212. [03/11 10:08:21] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 20096, area: 41865.0 x1: 100, y1: 265, x2: 337, y2: 265. [03/11 10:08:21] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 23656, area: 23394.5 x1: 215, y1: 217, x2: 354, y2: 217. [03/11 10:08:24] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/liaolin/.cache/paddle/weights/ResNet50_vb_normal_pretrained.pdparams [03/11 10:08:27] ppdet.engine INFO: Epoch: [0] [ 0/4377] learning_rate: 0.000200 loss_class: 0.940893 loss_bbox: 1.618034 loss_giou: 0.855736 loss_class_aux: 4.874199 loss_bbox_aux: 7.980593 loss_giou_aux: 4.278678 loss: 20.548134 eta: 8 days, 13:32:59 batch_cost: 3.3812 datacost: 0.0002 ips: 0.2958 images/s**nvidia-smi显示结果: +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3741729 C ...a3/envs/paddle/bin/python 7221MiB | | 0 N/A N/A 3741734 C ...a3/envs/paddle/bin/python 393MiB | | 0 N/A N/A 3741740 C ...a3/envs/paddle/bin/python 393MiB | | 0 N/A N/A 3741745 C ...a3/envs/paddle/bin/python 393MiB | | 0 N/A N/A 3741750 C ...a3/envs/paddle/bin/python 393MiB | | 0 N/A N/A 3741755 C ...a3/envs/paddle/bin/python 393MiB | | 1 N/A N/A 3741734 C ...a3/envs/paddle/bin/python 9865MiB | | 2 N/A N/A 3741740 C ...a3/envs/paddle/bin/python 7171MiB | | 3 N/A N/A 3741745 C ...a3/envs/paddle/bin/python 8559MiB | | 4 N/A N/A 3741750 C ...a3/envs/paddle/bin/python 8561MiB | | 5 N/A N/A 3741755 C ...a3/envs/paddle/bin/python 9211MiB