显存异常状态 - Githubissues

LLsmile commented 2 years ago

训练模型时，gpu0 处于异常状态，训练速度特别慢，不知道是不是nccl没装好训练模型命令： python -m paddle.distributed.launch --gpus 0,1,2,3,4,5 tools/train.py -c configs/deformable_detr/deformable_detr_r50_1x_coco.yml --eval

LOG信息： **_----------- Configuration Arguments ----------- backend: auto elastic_server: None force: False gpus: 0,1,2,3,4,5 heter_devices: heter_worker_num: None heter_workers: host: None http_port: None ips: 127.0.0.1 job_id: None log_dir: log np: None nproc_per_node: None run_mode: None scale: 0 server_num: None servers: training_script: tools/train.py training_script_args: ['-c', 'configs/deformable_detr/deformable_detr_r50_1x_coco.yml', '--eval'] worker_num: None workers:

WARNING 2022-03-11 10:08:07,354 launch.py:422] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode launch train in GPU mode! INFO 2022-03-11 10:08:07,355 launch_utils.py:525] Local start 6 processes. First process distributed environment info (Only For Debug): +=======================================================================================+ | Distributed Envs Value | +---------------------------------------------------------------------------------------+ | PADDLE_TRAINER_ID 0 | | PADDLE_CURRENT_ENDPOINT 127.0.0.1:52035 | | PADDLE_TRAINERS_NUM 6 | | PADDLE_TRAINER_ENDPOINTS ... 0.1:37911,127.0.0.1:56441,127.0.0.1:60701| | PADDLE_RANK_IN_NODE 0 | | PADDLE_LOCAL_DEVICE_IDS 0 | | PADDLE_WORLD_DEVICE_IDS 0,1,2,3,4,5 | | FLAGS_selected_gpus 0 | | FLAGS_selected_accelerators 0 | +=======================================================================================+

INFO 2022-03-11 10:08:07,355 launch_utils.py:530] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0 launch proc_id:3741729 idx:0 launch proc_id:3741734 idx:1 launch proc_id:3741740 idx:2 launch proc_id:3741745 idx:3 launch proc_id:3741750 idx:4 launch proc_id:3741755 idx:5 /home/liaolin/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:130: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:57999', '127.0.0.1:48623', '127.0.0.1:37911', '127.0.0.1:56441', '127.0.0.1:60701'] I0311 10:08:15.285431 3741729 nccl_context.cc:74] init nccl context nranks: 6 local rank: 0 gpu id: 0 ring id: 0 W0311 10:08:17.462484 3741729 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 11.2 W0311 10:08:17.477650 3741729 device_context.cc:465] device: 0, cuDNN Version: 8.3. loading annotations into memory... Done (t=1.14s) creating index... index created! [03/11 10:08:20] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 5081, area: 15734.0 x1: 309, y1: 197, x2: 433.2765957446809, y2: 197. [03/11 10:08:20] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 5854, area: 22190.0 x1: 277, y1: 195, x2: 421, y2: 195. [03/11 10:08:21] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 18903, area: 21387.5 x1: 409, y1: 164, x2: 530, y2: 164. [03/11 10:08:21] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 19997, area: 22563.0 x1: 25, y1: 212, x2: 162, y2: 212. [03/11 10:08:21] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 20096, area: 41865.0 x1: 100, y1: 265, x2: 337, y2: 265. [03/11 10:08:21] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 23656, area: 23394.5 x1: 215, y1: 217, x2: 354, y2: 217. [03/11 10:08:24] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/liaolin/.cache/paddle/weights/ResNet50_vb_normal_pretrained.pdparams [03/11 10:08:27] ppdet.engine INFO: Epoch: [0] [ 0/4377] learning_rate: 0.000200 loss_class: 0.940893 loss_bbox: 1.618034 loss_giou: 0.855736 loss_class_aux: 4.874199 loss_bbox_aux: 7.980593 loss_giou_aux: 4.278678 loss: 20.548134 eta: 8 days, 13:32:59 batch_cost: 3.3812 datacost: 0.0002 ips: 0.2958 images/s**

nvidia-smi显示结果： +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3741729 C ...a3/envs/paddle/bin/python 7221MiB | | 0 N/A N/A 3741734 C ...a3/envs/paddle/bin/python 393MiB | | 0 N/A N/A 3741740 C ...a3/envs/paddle/bin/python 393MiB | | 0 N/A N/A 3741745 C ...a3/envs/paddle/bin/python 393MiB | | 0 N/A N/A 3741750 C ...a3/envs/paddle/bin/python 393MiB | | 0 N/A N/A 3741755 C ...a3/envs/paddle/bin/python 393MiB | | 1 N/A N/A 3741734 C ...a3/envs/paddle/bin/python 9865MiB | | 2 N/A N/A 3741740 C ...a3/envs/paddle/bin/python 7171MiB | | 3 N/A N/A 3741745 C ...a3/envs/paddle/bin/python 8559MiB | | 4 N/A N/A 3741750 C ...a3/envs/paddle/bin/python 8561MiB | | 5 N/A N/A 3741755 C ...a3/envs/paddle/bin/python 9211MiB

wangxinxin08 commented 2 years ago

GPU0处于异常状态是指？

LLsmile commented 2 years ago

gpu0上有多个线程在运行，而且程序直接卡死了，一张卡可以训练，多张卡就卡在上面那个状态了

LLsmile commented 2 years ago

而且奇怪的是，6张卡就卡死了，4张卡可以训练

wangxinxin08 commented 2 years ago

这个应该是hang住了，你可以提供下训练环境，我们定位下

wangxinxin08 commented 2 years ago

训练可以先用4卡训练

LLsmile commented 2 years ago

训练环境： cuda: 11.6 cudnn: cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive.tar.xz nccl: 因为服务器没有sudo权限，昨天刚从git上clone下来编译的其他python包都是按照paddledetection说明安装的，主要就pyyaml + paddle + paddledetection，paddledetection的测试程序正常通过。单卡训练120iter就挂掉了，多卡就跑不起来（包括上面说的4卡）。

最新训练命令和报错：

python -m paddle.distributed.launch --gpus 0,1,2,3 tools/train.py -c configs/detr/detr_r50_1x_coco.yml --eval
-----------  Configuration Arguments -----------
backend: auto
elastic_server: None
force: False
gpus: 0,1,2,3
heter_devices: 
heter_worker_num: None
heter_workers: 
host: None
http_port: None
ips: 127.0.0.1
job_id: None
log_dir: log
np: None
nproc_per_node: None
run_mode: None
scale: 0
server_num: None
servers: 
training_script: tools/train.py
training_script_args: ['-c', 'configs/detr/detr_r50_1x_coco.yml', '--eval']
worker_num: None
workers: 
------------------------------------------------
WARNING 2022-03-12 09:48:41,703 launch.py:422] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode
launch train in GPU mode!
INFO 2022-03-12 09:48:41,704 launch_utils.py:525] Local start 4 processes. First process distributed environment info (Only For Debug): 
    +=======================================================================================+
    |                        Distributed Envs                      Value                    |
    +---------------------------------------------------------------------------------------+
    |                       PADDLE_TRAINER_ID                        0                      |
    |                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:43025               |
    |                     PADDLE_TRAINERS_NUM                        4                      |
    |                PADDLE_TRAINER_ENDPOINTS  ... 0.1:56729,127.0.0.1:42147,127.0.0.1:57509|
    |                     PADDLE_RANK_IN_NODE                        0                      |
    |                 PADDLE_LOCAL_DEVICE_IDS                        0                      |
    |                 PADDLE_WORLD_DEVICE_IDS                     0,1,2,3                   |
    |                     FLAGS_selected_gpus                        0                      |
    |             FLAGS_selected_accelerators                        0                      |
    +=======================================================================================+

INFO 2022-03-12 09:48:41,704 launch_utils.py:530] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
launch proc_id:2324247 idx:0
launch proc_id:2324252 idx:1
launch proc_id:2324257 idx:2
launch proc_id:2324262 idx:3
/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:130: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe. 
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if data.dtype == np.object:
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.0.1:56729', '127.0.0.1:42147', '127.0.0.1:57509']
INFO 2022-03-12 09:48:44,738 launch_utils.py:320] terminate process group gid:2324247
INFO 2022-03-12 09:48:48,742 launch_utils.py:341] terminate all the procs
ERROR 2022-03-12 09:48:48,742 launch_utils.py:602] ABORT!!! Out of all 4 trainers, the trainer process with rank=[1, 2, 3] was aborted. Please check its log.
INFO 2022-03-12 09:48:52,746 launch_utils.py:341] terminate all the procs
INFO 2022-03-12 09:48:52,746 launch.py:311] Local processes completed.

LLsmile commented 2 years ago

单卡120iteration后OOM日志：

`python -m paddle.distributed.launch --gpus 0 tools/train.py -c configs/detr/detr_r50_1x_coco.yml --eval ----------- Configuration Arguments ----------- backend: auto elastic_server: None force: False gpus: 0 heter_devices: heter_worker_num: None heter_workers: host: None http_port: None ips: 127.0.0.1 job_id: None log_dir: log np: None nproc_per_node: None run_mode: None scale: 0 server_num: None servers: training_script: tools/train.py training_script_args: ['-c', 'configs/detr/detr_r50_1x_coco.yml', '--eval'] worker_num: None workers:

WARNING 2022-03-12 09:32:47,422 launch.py:422] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode launch train in GPU mode! INFO 2022-03-12 09:32:47,422 launch_utils.py:525] Local start 1 processes. First process distributed environment info (Only For Debug): +=======================================================================================+ | Distributed Envs Value | +---------------------------------------------------------------------------------------+ | PADDLE_TRAINER_ID 0 | | PADDLE_CURRENT_ENDPOINT 127.0.0.1:47835 | | PADDLE_TRAINERS_NUM 1 | | PADDLE_TRAINER_ENDPOINTS 127.0.0.1:47835 | | PADDLE_RANK_IN_NODE 0 | | PADDLE_LOCAL_DEVICE_IDS 0 | | PADDLE_WORLD_DEVICE_IDS 0 | | FLAGS_selected_gpus 0 | | FLAGS_selected_accelerators 0 | +=======================================================================================+

INFO 2022-03-12 09:32:47,422 launch_utils.py:530] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0 launch proc_id:2321426 idx:0 /home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/tensor/creation.py:130: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: loading annotations into memory... Done (t=0.15s) creating index... index created! W0312 09:32:50.000304 2321426 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.4, Runtime API Version: 11.1 W0312 09:32:50.002843 2321426 device_context.cc:465] device: 0, cuDNN Version: 8.2. [03/12 09:32:52] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/xyz/.cache/paddle/weights/ResNet50_vb_normal_pretrained.pdparams [03/12 09:32:53] ppdet.engine INFO: Epoch: [0] [ 0/1680] learning_rate: 0.000100 loss_class: 0.582332 loss_bbox: 4.075541 loss_giou: 2.120999 loss_class_aux: 3.271304 loss_bbox_aux: 20.612900 loss_giou_aux: 10.542332 loss: 41.205406 eta: 8 days, 16:13:16 batch_cost: 0.8924 data_cost: 0.0001 ips: 2.2412 images/s [03/12 09:32:58] ppdet.engine INFO: Epoch: [0] [ 20/1680] learning_rate: 0.000100 loss_class: 0.638008 loss_bbox: 2.514911 loss_giou: 2.325258 loss_class_aux: 3.177923 loss_bbox_aux: 12.664510 loss_giou_aux: 11.671243 loss: 33.304543 eta: 2 days, 21:31:16 batch_cost: 0.2682 data_cost: 0.0001 ips: 7.4562 images/s [03/12 09:33:14] ppdet.engine INFO: Epoch: [0] [ 40/1680] learning_rate: 0.000100 loss_class: 0.639883 loss_bbox: 2.334148 loss_giou: 2.305621 loss_class_aux: 3.143493 loss_bbox_aux: 11.770285 loss_giou_aux: 11.707081 loss: 31.909885 eta: 5 days, 3:09:09 batch_cost: 0.7692 data_cost: 0.0001 ips: 2.6002 images/s [03/12 09:33:19] ppdet.engine INFO: Epoch: [0] [ 60/1680] learning_rate: 0.000100 loss_class: 0.646600 loss_bbox: 1.935449 loss_giou: 2.062631 loss_class_aux: 2.831934 loss_bbox_aux: 9.859499 loss_giou_aux: 10.329891 loss: 27.717958 eta: 4 days, 6:43:06 batch_cost: 0.2607 data_cost: 0.0001 ips: 7.6706 images/s [03/12 09:33:24] ppdet.engine INFO: Epoch: [0] [ 80/1680] learning_rate: 0.000100 loss_class: 0.599850 loss_bbox: 1.185453 loss_giou: 1.307388 loss_class_aux: 2.824816 loss_bbox_aux: 6.330995 loss_giou_aux: 7.080967 loss: 18.742527 eta: 3 days, 21:13:12 batch_cost: 0.2754 data_cost: 0.0001 ips: 7.2617 images/s [03/12 09:33:30] ppdet.engine INFO: Epoch: [0] [ 100/1680] learning_rate: 0.000100 loss_class: 0.642262 loss_bbox: 0.906488 loss_giou: 0.888945 loss_class_aux: 3.464796 loss_bbox_aux: 4.355124 loss_giou_aux: 4.804581 loss: 15.335629 eta: 3 days, 15:12:05 batch_cost: 0.2693 data_cost: 0.0001 ips: 7.4260 images/s [03/12 09:33:35] ppdet.engine INFO: Epoch: [0] [ 120/1680] learning_rate: 0.000100 loss_class: 0.665200 loss_bbox: 0.827738 loss_giou: 0.843559 loss_class_aux: 3.410576 loss_bbox_aux: 4.296345 loss_giou_aux: 4.450683 loss: 14.734106 eta: 3 days, 11:14:01 batch_cost: 0.2709 data_cost: 0.0001 ips: 7.3821 images/s Traceback (most recent call last): File "tools/train.py", line 177, in main() File "tools/train.py", line 173, in main run(FLAGS, cfg) File "tools/train.py", line 127, in run trainer.train(FLAGS.eval) File "/home/xyz/workspace/06_study_materials/02_model/01_detection/PaddleDetection/ppdet/engine/trainer.py", line 401, in train outputs = model(data) File "/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 917, in call return self._dygraph_call_func(*inputs, kwargs) File "/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func outputs = self.forward(*inputs, *kwargs) File "/home/xyz/workspace/06_study_materials/02_model/01_detection/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 54, in forward out = self.get_loss() File "/home/xyz/workspace/06_study_materials/02_model/01_detection/PaddleDetection/ppdet/modeling/architectures/detr.py", line 80, in get_loss losses = self._forward() File "/home/xyz/workspace/06_study_materials/02_model/01_detection/PaddleDetection/ppdet/modeling/architectures/detr.py", line 68, in _forward out_transformer = self.transformer(body_feats, self.inputs['pad_mask']) File "/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 917, in call return self._dygraph_call_func(inputs, kwargs) File "/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func outputs = self.forward(*inputs, kwargs) File "/home/xyz/workspace/06_study_materials/02_model/01_detection/PaddleDetection/ppdet/modeling/transformers/detr_transformer.py", line 339, in forward memory = self.encoder( File "/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 917, in call return self._dygraph_call_func(*inputs, *kwargs) File "/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func outputs = self.forward(inputs, kwargs) File "/home/xyz/workspace/06_study_materials/02_model/01_detection/PaddleDetection/ppdet/modeling/transformers/detr_transformer.py", line 106, in forward output = layer(output, src_mask=src_mask, pos_embed=pos_embed) File "/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 917, in call return self._dygraph_call_func(*inputs, kwargs) File "/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func outputs = self.forward(*inputs, *kwargs) File "/home/xyz/workspace/06_study_materials/02_model/01_detection/PaddleDetection/ppdet/modeling/transformers/detr_transformer.py", line 78, in forward src = self.self_attn(q, k, value=src, attn_mask=src_mask) File "/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 917, in call return self._dygraph_call_func(inputs, kwargs) File "/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "/home/xyz/workspace/06_study_materials/02_model/01_detection/PaddleDetection/ppdet/modeling/layers.py", line 1372, in forward weights = F.dropout( File "/home/xyz/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/nn/functional/common.py", line 893, in dropout out, mask = _C_ops.dropout( SystemError: (Fatal) Operator dropout raises an paddle::memory::allocation::BadAlloc exception. The exception content is :ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 43.066650MB memory on GPU 0, 5.743042GB memory has been allocated and available memory is only 51.187500MB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model.

(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79) . (at /paddle/paddle/fluid/imperative/tracer.cc:221)

INFO 2022-03-12 09:33:57,476 launch_utils.py:341] terminate all the procs ERROR 2022-03-12 09:33:57,476 launch_utils.py:602] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log. INFO 2022-03-12 09:34:01,480 launch_utils.py:341] terminate all the procs INFO 2022-03-12 09:34:01,481 launch.py:311] Local processes completed. `

wangxinxin08 commented 2 years ago

显存溢出后应该使用ps aux | grep detr | awk '{pritn $2}' | xargs kill -9杀掉进程，避免显存溢出后进程没有退出

wangxinxin08 commented 2 years ago

训练环境说一下PaddlePaddle和PaddleDetection的版本

LLsmile commented 2 years ago

paddlepaddle版本号： paddle version PaddlePaddle 2.2.2.post112, compiled with with_avx: ON with_gpu: ON with_mkl: ON with_mkldnn: ON with_python: ON paddledetection是v2.3.0

wangxinxin08 commented 2 years ago

现在能够4卡还可以正常训练吗？

LLsmile commented 2 years ago

刚试了一下，可以训练，但是gpu0上还是挂了多个线程

wangxinxin08 commented 2 years ago

挂了多个线程是指什么？

LLsmile commented 2 years ago

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3824253      C   ...a3/envs/paddle/bin/python     6057MiB |
|    0   N/A  N/A   3824258      C   ...a3/envs/paddle/bin/python      393MiB |
|    0   N/A  N/A   3824263      C   ...a3/envs/paddle/bin/python      393MiB |
|    0   N/A  N/A   3824268      C   ...a3/envs/paddle/bin/python      393MiB |
|    1   N/A  N/A   3824258      C   ...a3/envs/paddle/bin/python     7723MiB |
|    2   N/A  N/A   3824263      C   ...a3/envs/paddle/bin/python     6463MiB |
|    3   N/A  N/A   3824268      C   ...a3/envs/paddle/bin/python     5815MiB |

LLsmile commented 2 years ago

而且单卡训练的时候，就算我指定了gpu9，也会默认在gpu0上开一个小线程，可能也是大概393M吧

wangxinxin08 commented 2 years ago

而且单卡训练的时候，就算我指定了gpu9，也会默认在gpu0上开一个小线程，可能也是大概393M吧

这个是使用了distributed的启动方式吗？如果是的话，为了避免这种情况，可以设置CUDA_VISIBLE_DEVICES为你要使用的卡

LLsmile commented 2 years ago

上面那个gpu状态就是这句代码的结果 python -m paddle.distributed.launch --gpus 0,1,2,3 tools/train.py -c configs/deformable_detr/deformable_detr_r50_1x_coco.yml --eval

wangxinxin08 commented 2 years ago

上面那个gpu状态就是这句代码的结果 python -m paddle.distributed.launch --gpus 0,1,2,3 tools/train.py -c configs/deformable_detr/deformable_detr_r50_1x_coco.yml --eval

分布式的小线程应该是distributed.launch引入的

LLsmile commented 2 years ago

而且单卡训练的时候，就算我指定了gpu9，也会默认在gpu0上开一个小线程，可能也是大概393M吧

这个是使用了distributed的启动方式吗？如果是的话，为了避免这种情况，可以设置CUDA_VISIBLE_DEVICES为你要使用的卡

这个问题比较准确的描述就是无论我使用了哪张卡，都会在gpu0上多一个线程出来。一张卡如此，多张卡也是这样。

LLsmile commented 2 years ago

上面那个gpu状态就是这句代码的结果 python -m paddle.distributed.launch --gpus 0,1,2,3 tools/train.py -c configs/deformable_detr/deformable_detr_r50_1x_coco.yml --eval

分布式的小线程应该是distributed.launch引入的

那正常训练不是这样的吗？

wangxinxin08 commented 2 years ago

而且单卡训练的时候，就算我指定了gpu9，也会默认在gpu0上开一个小线程，可能也是大概393M吧

这个是使用了distributed的启动方式吗？如果是的话，为了避免这种情况，可以设置CUDA_VISIBLE_DEVICES为你要使用的卡

这个问题比较准确的描述就是无论我使用了哪张卡，都会在gpu0上多一个线程出来。一张卡如此，多张卡也是这样。

看上面的回复，可以设置下可见性看看

wangxinxin08 commented 2 years ago

上面那个gpu状态就是这句代码的结果 python -m paddle.distributed.launch --gpus 0,1,2,3 tools/train.py -c configs/deformable_detr/deformable_detr_r50_1x_coco.yml --eval

分布式的小线程应该是distributed.launch引入的

那正常训练不是这样的吗？

所以多了一个小线程是正常的，如果你不用gpu0，gpu0上也多了一个小线程的话，可以试下使用CUDA_VISIBLE_DEVICES设置下可见性

LLsmile commented 2 years ago

如果这个算正常，那暂时没有问题了

PaddlePaddle / PaddleDetection

显存异常状态 #5353