Paddle使用单机单卡训练指定卡不生效，一直在0号卡申请显存

kismit commented 2 years ago

为使您的问题得到快速解决，在建立Issues前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】

如果您没有查询到相似问题，为快速解决您的提问，建立issue时请提供如下细节信息：

标题：简洁、精准概括您的问题，例如“Insufficient Memory xxx" ”
版本、环境信息： 1）PaddlePaddle版本：请提供您的PaddlePaddle版本号，例如1.1或CommitID paddle: 2.2.0 gpu

2）CPU：预测若用CPU，请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库使用情况 3）GPU：预测若用GPU，请提供GPU型号、CUDA和CUDNN版本号 cuda: 10.1 cudnn: 8.0

4）系统环境：请您描述系统类型、版本，例如Mac OS 10.14，Python版本 ubuntu16.04
python3.6.13

注：您可以通过执行summary_env.py获取以上信息。

训练信息 1）单机/多机，单卡/多卡 2）显存信息 3）Operator信息单机单卡训练设置使用1号卡，报错提示0号卡申请不到内存

export CUDA_VISIBLE_DEVICES=1
python ptq.py

报错信息


SystemError: (Fatal) Operator log raises an paddle::memory::allocation::BadAlloc exception.
The exception content is
:ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 28.381592MB memory on GPU 0, 10.757812GB memory has been allocated and available memory is only 3.437500MB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model.

(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79) . (at /paddle/paddle/fluid/imperative/tracer.cc:221)

INFO 2021-12-09 01:36:32,293 launch_utils.py:340] terminate all the procs ERROR 2021-12-09 01:36:32,293 launch_utils.py:603] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.



- 复现信息：如为报错，请给出复现环境、复现步骤
- 问题描述：请详细描述您的问题，同步贴出报错信息、日志、可复现的代码片段

Thank you for contributing to PaddlePaddle.
Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
If there is no solution,please make sure that this is a training issue including the following details:
**System information**
-PaddlePaddle version （eg.1.1）or CommitID
-CPU: including CPUMKL/OpenBlas/MKLDNN version
-GPU: including CUDA/CUDNN version
-OS Platform (eg.Mac OS 10.14)
-Other imformation: Distriuted training/informantion of operator/
Graphics card storage
Note: You can get most of the information by running [summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py). 
**To Reproduce**
Steps to reproduce the behavior
**Describe your current behavior**
**Code to reproduce the issue**
**Other info / logs**

paddle-bot-old[bot] commented 2 years ago

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

kismit commented 2 years ago

W1209 01:48:00.638737 62789 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.2, Runtime API Version: 10.1 W1209 01:48:00.640688 62789 device_context.cc:465] device: 0, cuDNN Version: 8.0. activation_quantizer KLQuantizer 2021-12-09 01:48:02,675-INFO: The layers to be fused: src_fullname: /data/home/xuchunguang/work/AI/components/NLP/common/pptrtrans/corpus/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/train.tok.clean.bpe.33708.en src_fullname: /data/home/xuchunguang/work/AI/components/NLP/common/pptrtrans/corpus/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/newstest2013.tok.bpe.33708.en Traceback (most recent call last): File "ptq.py", line 158, in main(args) File "ptq.py", line 133, in main quant_model(src_word) File "/data/home/xuchunguang/.conda/envs/paddle2/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in call outputs = self.forward(*inputs, kwargs) File "/data/home/xuchunguang/.conda/envs/paddle2/lib/python3.6/site-packages/paddlenlp/transformers/transformer/modeling.py", line 1009, in forward trg_length=trg_length) File "/data/home/xuchunguang/.conda/envs/paddle2/lib/python3.6/site-packages/paddle/fluid/layers/rnn.py", line 1668, in dynamic_decode is_test, return_length, kwargs) File "/data/home/xuchunguang/.conda/envs/paddle2/lib/python3.6/site-packages/paddle/fluid/layers/rnn.py", line 1374, in _dynamic_decode_imperative step_idx_tensor, inputs, states, **kwargs) File "/data/home/xuchunguang/.conda/envs/paddle2/lib/python3.6/site-packages/paddlenlp/transformers/transformer/modeling.py", line 544, in step beam_state=states) File "/data/home/xuchunguang/.conda/envs/paddle2/lib/python3.6/site-packages/paddle/fluid/layers/rnn.py", line 1207, in _beam_search_step step_log_probs = nn.log(nn.softmax(logits)) File "/data/home/xuchunguang/.conda/envs/paddle2/lib/python3.6/site-packages/paddle/fluid/layers/nn.py", line 8800, in log return _C_ops.log(x) SystemError: (Fatal) Operator log raises an paddle::memory::allocation::BadAlloc exception. The exception content is :ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 28.381592MB memory on GPU 0, 10.757812GB memory has been allocated and available memory is only 3.437500MB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model.

(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79) . (at /paddle/paddle/fluid/imperative/tracer.cc:221)

haozech commented 2 years ago

您好，问题已收到，相关同学会尽快为您解答。

kismit commented 2 years ago

这个问题是batch size太大了，我设置小了就没问题了，应该可以关闭这个case了。

haozech commented 2 years ago

这个问题是batch size太大了，我设置小了就没问题了，应该可以关闭这个case了。

赞👍🏻

PaddlePaddle / Paddle

Paddle使用单机单卡训练指定卡不生效，一直在0号卡申请显存 #37988