对大批量数据进行人像分割推理时触发Linux OOM机制导致程序被kill

enemy1205 commented 1 year ago

问题确认 Search before asking

[X] 我已经搜索过问题，但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

按照Readme.md参考人像分割教程，使用PaddleSeg/contrib/PP-HumanSeg/src/seg_demo.py 由于需要进行大批量视频的分割，因此对seg_demo.py文件进行了一定处理简化，由于仅需二值图，故删去较多不必要部分

# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import os
import sys

import cv2
import numpy as np

__dir__ = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.abspath(os.path.join(__dir__, '../../../')))
from paddleseg.utils import get_sys_env, logger, get_image_list
from infer import Predictor

def parse_args():
    parser = argparse.ArgumentParser(
        description='PP-HumanSeg inference for video')
    parser.add_argument(
        "--config",
        help="The config file of the inference model.",
        type=str,
        required=True)
    parser.add_argument(
        '--video_path', help='Video path for inference', type=str)

    parser.add_argument(
        '--vertical_screen',
        help='The input image is generated by vertical screen, i.e. height is bigger than width.'
        'For the input image, we assume the width is bigger than the height by default.',
        action='store_true')
    parser.add_argument(
        '--use_post_process', help='Use post process.', action='store_true')
    parser.add_argument(
        '--use_optic_flow', help='Use optical flow.', action='store_true')
    parser.add_argument(
        '--test_speed',
        help='Whether to test inference speed',
        action='store_true')

    return parser.parse_args()

def makedirs(save_dir):
    dirname = save_dir if os.path.isdir(save_dir) else \
        os.path.dirname(save_dir)
    if not os.path.exists(dirname):
        os.makedirs(dirname)

def seg_video(predictor,video_name,video_path,save_folder):
    assert os.path.exists(video_path), \
        'The --video_path is not existed: {}'.format(video_path)
    folder_layers=video_name.split('-')
    save_folder = os.path.join(save_folder,folder_layers[0])
    save_folder = os.path.join(save_folder,folder_layers[1]+'-'+folder_layers[2])
    save_folder = os.path.join(save_folder,folder_layers[3])[:-4]
    os.makedirs(save_folder, exist_ok=True)
    cap_img = cv2.VideoCapture(video_path)
    assert cap_img.isOpened(), "Fail to open video:{}".format(video_path)
    frame_count=1
    while cap_img.isOpened():
        ret_img, img = cap_img.read()
        if not ret_img:
            break
        out = predictor.run(img)
        frame_filename = os.path.join(save_folder, f'{video_name[:-4]}-{frame_count:03d}.png')
        cv2.imwrite(frame_filename, out)
        frame_count +=1
    cap_img.release()

if __name__ == "__main__":
    args = parse_args()
    env_info = get_sys_env()
    args.use_gpu = True if env_info['Paddle compiled with cuda'] \
        and env_info['GPUs used'] else False
    save_folder = 'mysave_path'
#my_video_path中存在较多视频文件(>10000)
    video_folder = 'my_video_path'
    if not os.path.exists(save_folder):
        os.makedirs(save_folder)
    predictor = Predictor(args)
    video_names = os.listdir(video_folder)
    video_paths = [os.path.join(video_folder,name) for name in video_names]
    for name , path in zip(video_names,video_paths):
        seg_video(predictor,name,path,save_folder)
        print(f'{name} seg complete!')

此外其他文件未作修改，使用推理模型human_pp_humansegv1_server_512x512_inference_model_with_softmax 可执行脚本:

#! /bin/bash
export CUDA_VISIBLE_DEVICES=7
python src/seg_demo.py --config inference_models/human_pp_humansegv1_server_512x512_inference_model_with_softmax/deploy.yaml

tmux 挂至后台后(直接命令行运行情况相同)，将会正常运行一段时间显存占用 2~3G/24G(3090Ti)，显存及显卡利用率都比较正常

但是，随着时间推移，大约每半分钟将会占用1G内存并且累积，最后直至触发Linux OOM机制导致被kill掉。

尝试过数次，以及调试，在每个视频的处理完后，它确实会释放部分内存，但每个视频增加的内存>释放的内存，，，，最终导致250GB+的内存也被占满。。

enemy1205 commented 1 year ago

human_pp_humansegv1_server_512x512_inference_model_with_softmax来源于readme.md上链接，模型及参数未进行过任何修改

enemy1205 commented 1 year ago

如图所示，内存稳步增长至OOM

shiyutang commented 1 year ago

这可能是一个内存泄漏的bug，我们会尽快修复

shiyutang commented 10 months ago

@enemy1205 我们这边似乎没有办法复现这个问题，能否进一步提供你的环境信息，例如paddle版本等

enemy1205 commented 10 months ago

如图为测试过程中的记录，256G内存已占用36%并仍在逐步上升

$ nvidia-smi
Wed Nov 29 23:43:48 2023<br/>+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

Python 3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import paddle
print(paddle.__version__)
2.5.1
paddle.fluid.is_compiled_with_cuda()
True

paddle.utils.run_check()
Running verify PaddlePaddle program ...
I1129 23:33:14.719758 2525999 interpretercore.cc:237] New Executor is Running.
W1129 23:33:14.720786 2525999 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.0, Runtime API Version: 11.1
W1129 23:33:14.724933 2525999 gpu_resources.cc:149] device: 0, cuDNN Version: 8.0.

$ pip list
Package            Version
------------------ -------------
anyio              4.0.0
astor              0.8.1
Babel              2.12.1
bce-python-sdk     0.8.90
blinker            1.6.2
cachetools         5.3.1
certifi            2023.7.22
charset-normalizer 3.2.0
click              8.1.7
contourpy          1.1.0
cycler             0.11.0
decorator          5.1.1
exceptiongroup     1.1.3
filelock           3.12.3
Flask              2.3.3
flask-babel        3.1.0
fonttools          4.42.1
future             0.18.3
h11                0.14.0
httpcore           0.17.3
httpx              0.24.1
idna               3.4
itsdangerous       2.1.2
Jinja2             3.1.2
joblib             1.3.2
kiwisolver         1.4.5
MarkupSafe         2.1.3
matplotlib         3.7.2
numpy              1.25.2
nvidia-ml-py       12.535.108
nvitop             1.3.0
opencv-python      4.5.5.64
opt-einsum         3.3.0
packaging          23.1
paddle-bfloat      0.1.7
paddlepaddle-gpu   2.5.1.post112
pandas             2.1.0
Pillow             10.0.0
pip                23.2.1
prettytable        3.8.0
protobuf           4.24.2
psutil             5.9.5
pycryptodome       3.18.0
pyparsing          3.0.9
python-dateutil    2.8.2
pytz               2023.3
PyYAML             6.0.1
rarfile            4.0
requests           2.31.0
scikit-learn       1.3.0
scipy              1.11.2
setuptools         68.0.0
six                1.16.0
sniffio            1.3.0
termcolor          2.3.0
threadpoolctl      3.2.0
tqdm               4.66.1
typing_extensions  4.7.1
tzdata             2023.3
urllib3            2.0.4
visualdl           2.5.3
wcwidth            0.2.6
Werkzeug           2.3.7
wheel              0.38.4

Ubuntu 22.04.2

5.19.0-42-generic

PaddlePaddle / PaddleSeg