PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.13k stars 5.55k forks source link

单个显卡执行empy_cache()时,提示out of memory #65787

Open ninscious opened 3 months ago

ninscious commented 3 months ago

请提出你的问题 Please ask your question

目前有一台机器4090跑语义分割任务,每次执行时会清一下显存,执行empty_cache(),但是偶发会出现提示out of memory错误。具体提示如下 `Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/paddleseg/cvlibs/builder.py", line 66, in build_component obj = self.build_component_impl(com_class, *params) File "/usr/local/lib/python3.10/dist-packages/paddleseg/cvlibs/builder.py", line 80, in build_component_impl return component_class(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/paddleseg/models/backbones/hrnet.py", line 802, in HRNet_W48 model = HRNet( File "/usr/local/lib/python3.10/dist-packages/paddleseg/models/backbones/hrnet.py", line 97, in init self.conv_layer1_1 = layers.ConvBNReLU( File "/usr/local/lib/python3.10/dist-packages/paddleseg/models/layers/layer_libs.py", line 44, in init self._conv = nn.Conv2D( File "/usr/local/lib/python3.10/dist-packages/paddle/nn/layer/conv.py", line 690, in init super().init( File "/usr/local/lib/python3.10/dist-packages/paddle/nn/layer/conv.py", line 156, in init self.weight = self.create_parameter( File "/usr/local/lib/python3.10/dist-packages/paddle/nn/layer/layers.py", line 781, in create_parameter return self._helper.create_parameter( File "/usr/local/lib/python3.10/dist-packages/paddle/base/layer_helper_base.py", line 430, in create_parameter return self.main_program.global_block().create_parameter( File "/usr/local/lib/python3.10/dist-packages/paddle/base/framework.py", line 4381, in create_parameter initializer(param, self) File "/usr/local/lib/python3.10/dist-packages/paddle/nn/initializer/initializer.py", line 40, in call return self.forward(param, block) File "/usr/local/lib/python3.10/dist-packages/paddle/nn/initializer/normal.py", line 75, in forward out_var = _C_ops.gaussian( OSError: (External) CUDA error(2), out of memory. [Hint: 'cudaErrorMemoryAllocation'. The API call failed because it was unable to allocate enough memory to perform the requested operation. ] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:209)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/soft/paddle_out/api/seg_utils/model_predict.py", line 175, in use_cuda_device yield File "/home/soft/paddle_out/api/seg_utils/model_predict.py", line 191, in model_predict model, transforms = load_model_new(args_dict) File "/home/soft/paddle_out/api/seg_utils/model_predict.py", line 51, in load_model_new model = builder.model File "/usr/local/lib/python3.10/dist-packages/paddleseg/utils/utils.py", line 275, in get val = self.func(obj) File "/usr/local/lib/python3.10/dist-packages/paddleseg/cvlibs/builder.py", line 153, in model return self.build_component(model_cfg) File "/usr/local/lib/python3.10/dist-packages/paddleseg/cvlibs/builder.py", line 56, in build_component params[key] = self.build_component(val) File "/usr/local/lib/python3.10/dist-packages/paddleseg/cvlibs/builder.py", line 72, in build_component raise RuntimeError( RuntimeError: Tried to create a HRNet_W48 object, but the operation has failed. Please double check the arguments used to create the object. The error message is: (External) CUDA error(2), out of memory. [Hint: 'cudaErrorMemoryAllocation'. The API call failed because it was unable to allocate enough memory to perform the requested operation. ] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:209)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tenacity/init.py", line 472, in call result = fn(*args, **kwargs) File "/home/soft/paddle_out/api/seg_utils/split_predict.py", line 372, in seg_split_predict img_result = use_paddleseg_model_to_predict_semantic_segmentation(paddleseg_args_dict, File "/home/soft/paddle_out/api/seg_utils/split_predict.py", line 200, in use_paddleseg_model_to_predict_semantic_segmentation return_dict = model_predict(paddleseg_args_dict, progress_bar=progress_bar, time_scale=time_scale) File "/home/soft/paddle_out/api/seg_utils/model_predict.py", line 189, in model_predict with use_cuda_device(0): File "/usr/lib/python3.10/contextlib.py", line 153, in exit self.gen.throw(typ, value, traceback) File "/home/soft/paddle_out/api/seg_utils/model_predict.py", line 178, in use_cuda_device paddle.device.cuda.empty_cache() File "/usr/local/lib/python3.10/dist-packages/paddle/device/cuda/init.py", line 173, in empty_cache core.cuda_empty_cache() OSError: (External) CUDA error(2), out of memory. [Hint: 'cudaErrorMemoryAllocation'. The API call failed because it was unable to allocate enough memory to perform the requested operation. ] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:209)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/soft/paddle_out/api/predict/utils.py", line 357, in seg_predict_out seg_split_predict(image_path_list[0], File "/usr/local/lib/python3.10/dist-packages/tenacity/init.py", line 332, in wrapped_f return self(f, *args, **kw) File "/usr/local/lib/python3.10/dist-packages/tenacity/init.py", line 469, in call do = self.iter(retry_state=retry_state) File "/usr/local/lib/python3.10/dist-packages/tenacity/init.py", line 370, in iter result = action(retry_state) File "/usr/local/lib/python3.10/dist-packages/tenacity/init.py", line 413, in exc_check raise retry_exc from fut.exception() tenacity.RetryError: RetryError[<Future at 0x7ff99741cc40 state=finished raised OSError>]`

ninscious commented 3 months ago

这个服务部署了3个集群,另外每次都是其中同一个节点出的问题。实际上看了下后台整体显存最高时候也只用了25%。